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FIELD OF THE INVENTION 

This invention relates to a method for constructing 
predictive models from data that is particularly well -suited for 
modeling insurance risks or policyholder profitability, based on 
historical policy and claims data. 

BACKGROUND OF THE INVENTION 

Our work considers a widely applicable method of 
constructing segmentation-based predictive models from data that 
permits limits to be placed on the statistical estimation errors 
that can be tolerated with respect to various aspects of the 
models that are constructed. In this regard, we have discerned 
that the ability to limit estimation errors during model 
construction can be quite valuable in industries that use 
predictive models to help make financial decisions. In 
particular, we have discerned that this ability is of critical 
importance to the insurance industry. 

Insurers develop price structures for insurance policies 
based on actuarial risk models. These models predict the expected 
claims that will be filed by policyholders as a function of the 
policyholders' assessed levels of risk. A traditional method used 
by actuaries to construct risk models involves first segmenting 



Y0999-214 



an overall population of policyholders into a collection of risk 
groups based on a set of factors, such as age, gender, driving 
distance to place of employment, etc. The risk parameters of each 
group (i.e., segment) are then estimated from historical policy 
5 and claims data. 

Ideally, the resulting risk groups should be homogeneous 
with respect to risk; that is, further subdividing the risk 
groups by introducing additional factors should yield 
10 substantially the same risk parameters. In addition, the risk 

groups should be actuarially credible; that is, the statistical 
errors in the estimates of the risk parameters of each group 

O 

.g should be sufficiently small so that fair and accurate premiums 
iff can be charged to the members of each risk group. 

15 flj 

|:=s However, identifying homogeneous risk groups that are also 

+ ! actuarially credible is not a simple matter. Actuaries typically 
p employ a combination of intuition, guesswork, and trial -and-error 
PI hypothesis testing to identify suitable risk factors. For each 
20 y3 combination of risk factors that are explored, actuaries must 

,n estimate both the risk parameters of the resulting risk groups as 
well as the actuarial credibility of those parameter estimates. 
The human effort involved is often quite high and good risk 
models can take several years to develop and refine. 

25 

SUMMARY OF THE INVENTION 

Our method overcomes the limitations inherent in manual 
methods for constructing segmentation-based predictive models by 
30 combining automated search over possible segmentations with 
constraints on the statistical estimation errors that can be 
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tolerated in the predictive models that are constructed for each 
segment. In the case of insurance risk modeling, the segments 
would correspond to risk groups and the constraints would 
correspond to criteria used by actuaries to assess actuarial 
credibility. 

The benefit of our method, from the point of view of 
insurance risk modeling, is that automation enables potential 
risk factors to be analyzed in greater detail. Consequently, our 
method can be far more effective at identifying relevant risk 
factors than traditional methods employed by actuaries, to the 
point where new risk factors are identified that were previously 
unrecognized. Moreover, assessments of actuarial credibility are 
made throughout the process in order to ensure the credibility of 
the resulting risk groups. By constraining the automated search 
to only produce actuarially credible risk groups, our method 
enables highly predictive risk models to be developed in a matter 
of weeks or days . 

In addition to its use in constructing risk models, our 
methodology can also be used to construct profitability models. 
Whereas risk models segment policyholders into homogeneous groups 
according to their levels of risk, profitability models segment 
policyholders into homogeneous groups according to their loss 
ratios (i.e., the ratio of claims paid over premiums collected). 
Differences in profitability among the segments that are 
identified would enable actuaries to diagnose ailing insurance 
products to determine why they are unprofitable. Moreover the 
risk factors that define each segment, combined with the 
profitability predictions of each segment, would provide concrete 
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indications of how to adjust a product f s existing price structure 
to make it profitable. 

To this end, we have discerned that deficiencies exist in 
the prior art on segmentation-based modeling methods with regard 
to the kinds of constraints that can be placed on the statistical 
estimation errors that can be tolerated with respect to various 
aspects of the models that are constructed for each segment. 
These deficiencies are especially evident in the context of 
insurance risk modeling. 

For example, Gallagher (C. Gallagher, "Risk classification 
aided by new software tool," National Underwriter Property £ 
Casualty— Risk & Benefits Management, No. 17, p. 19, April 27, 
1992) discusses how the CHAID classification tree algorithm 
(G. V. Kass, "An exploratory technique for investigating large 
quantities of categorical data," Applied Statistics, Vol. 29, 
No. 2, pp. 119-127, 1980; and D. Biggs, B. de Ville, and E . Suen, 
"A method of choosing multiway partitions for classification and 
decision trees," Journal of Applied Statistics, Vol. 18, No. 1, 
pp. 49-62, 1991) can be used to segment populations of automobile 
insurance policyholders into high-risk and low-risk groups based 
on accident frequency. Gallagher's methodology is not restricted 
to CHAID and can also be applied in conjunction with other 
tree-based modeling packages, such as CART (L. Breiman, 
J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and 
Regression Trees, New York: Chapman & Hall, 1984), C4.5 
(J. R. Quinlan, C4.5: Programs for Machine Learning, San Mateo, 
CA: Morgan Kaufmann, 1993), SPRINT (J. C. Shafer, R. Agrawal, and 
M. Mehta, "SPRINT: a scalable parallel classifier for data 
mining," Proceedings of the 22nd International Conference on Very 
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Large Databases, Bombay, India, September 1996), and QUEST 
(W.-Y. Loh and Y.-S. Shih, " Split selection methods for 
classification trees," Statistica Slnica, Vol. 7, pp. 815-840, 

1997) ♦ A deficiency common to all of the above tree-based 
methods, however, is that the methods are not designed to take 
actuarial credibility into consideration; consequently, the risk 
groups that are identified by these tools are not guaranteed to 
be actuarially credible. 

As previously discussed, actuarial credibility has to do 
with the statistical accuracy of estimated risk parameters. 
Actuaries not only want risk models that have high predictive 
accuracy in terms of distinguishing high risks from low risks, 
they also want accurate statistical estimates of the risk 
parameters so that price structures can be derived from the risk 
models that are both fair and accurate. The calculations needed 
to assess actuarial credibility are specific to the statistical 
models used by actuaries to model insurance risk (see, for 
example, S. A. Klugman, H. H. Panjer, and G. E. Willmot, Loss 
Models: From Data to Decisions, New York: John Wiley & Sons, 

1998) . Tree-based methods in the prior art are simply not 
equipped to perform these calculations because they are not 
specifically designed for insurance risk modeling purposes. 

The above deficiency can be demonstrated by using property 
and casualty insurance as an example. For this type of insurance, 
a proposed risk group is said to be fully credible if the number 
of historical claims filed by that group as calculated from the 
training data is greater than or equal to a threshold whose value 
is determined from the average claim amount for the group, as 
calculated from the training data, and from the standard 
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deviation of those claims amounts, also as calculated from the 
training data. In short, the minimum number of claims needed to 
achieve full credibility is a function of the statistical 
characteristics of those claims. Because these statistical 
5 characteristics can vary from one proposed risk group to another, 
the minimum number of claims needed to achieve full credibility 
can likewise vary from one proposed risk group to another. 

Prior art tree-based methods are able to impose constraints 
10 on the minimum number of records in the training data per segment 
(i.e., per leaf node); however, these constraints are global in 
nature in that the thresholds are constant across all segments, 
■g For example, global thresholds can be placed either on the 
lH minimum number of training records per segment (see, for example, 
15 HJ the stopping rules discussed in the SPSS white paper entitled 
in "AnswerTree algorithm summary," available from SPSS, Inc. at 
s ** http://www.spss.com/cool/papers/algo_sum.htm), or on the minimum 
O number of training records per segment for each species of data 
i^} record (e.g., claim versus nonclaim records). Examples of the 
20 '0 latter include the stopping rules described by Loh and 

. in 

■Is? 

ijj Vanichsetakul (W.-Y. Loh and N. Vanichsetakul , "Tree-structured 

classification via generalized discriminant analysis," Journal of 
the American Statistical Association, Vol. 83, pp. 715-728, 1988) 
and the "fraction of objects" rule implemented in the STATISTICA 
25 package available from StatSoft, Inc. (see, for example, the 
StatSoft documentation available at 

http://www.statsoft.com/textbook/stclatre.html). Again, in both 
cases, the thresholds are constant across all segments 
constructed by the tree-based methods, whereas constraints on 
30 actuarial credibility entail thresholds that vary from one 
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segment to another as a function of the statistical 
characteristics of each segment. 

In sharp contrast, our segmentation-based modeling method is 
5 able to utilize complex constraints, such as actuarial 

credibility, as an integral part of the model building process so 
as to produce segmentations that satisfy the constraints. In 
particular , ; when applied to insurance risk modeling, our method 
ensures ^that the resulting risk groups will meet desired 
10 actuarial credibility constraints. 

A further deficiency of prior art method is that, even if 

! ^ the ability to apply complex statistical constraints were 

fjj incorporated into prior art methods, such as CHAID (see Kass 

15 m above, and Biggs et al. above) , CART (see Breiman et al. above) , 

5^ C4.5 (see Quinlan, 1993, above), SPRINT (see Shafer et al. 

=p above), and QUEST (see W.-Y. Loh and Y.-S. Shih above), the 

i'2 methods would apply the statistical constraints in an open-loop 

;f! fashion, in the sense that potential segments would first be 

20 u2 constructed and the statistical constraints would then be applied 

it z 

to decide when to stop refining the segments. With this open-loop 
approach, poor choices made during the construction of potential 
segments can cause premature termination of the segment 
refinement process by producing potential segments that violate 
25 the statistical constraints, despite the fact that it may have 

been possible to produce alternate segments that satisfy the 
constraints . 

In sharp contrast, the present invention uses statistical 
30 constraints in a closed-loop fashion to guide the construction of 
potential segments so as to produce segments that satisfy the 
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statistical constraints whenever it is feasible to do so. The 
method is closed-loop in the sense that the statistical 
constraints are used in a manner that is analogous to an error 
signal in a feedback control system, wherein the error signal is 
used to regulate the inputs to the process that is being 
controlled (see, for example, J. J. Distefano, A. R. Stubberud, 
and I. J. Williams, Schavm's Outline of Theory and Problems of 
Feedback and Control Systems, New York: McGraw-Hill, 1967). In 
the case of the present invention, the statistical constraints 
are repeatedly evaluated to determine whether or not they hold, 
and the results of the evaluations are used to regulate the 
construction of potential segments. This closed-loop methodology 
ensures that potential segments are constructed that satisfy the 
statistical constraints whenever it is feasible to do so. The 
methodology thereby avoids premature termination of the segment 
refinement process caused by poor choices made during the 
construction of potential segments. 

In addition to deficiencies with respect to the kinds of 
statistical constraints that can be imposed on segments , we have 
also discerned that several other deficiencies exist in the prior 
art on segmentation-based modeling methods from the point of view 
of insurance risk modeling. As described above, Gallagher's 
methodology for identifying risk groups is based on constructing 
segmentation -based models for predicting claim frequency- 
However, frequency is only one of the risk parameters relevant to 
automobile insurance. Other risk parameters include severity 
(i.e., mean claim amount), pure premium (i.e., frequency times 
severity), and loss ratio (i.e., pure premium over premium 
charged) . Segmenting policyholders into risk groups based on 
frequency alone, as described by Gallagher, and then estimating 
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other risk characteristics after the fact may yield suboptimal 
risk models because the resulting risk groups might not be 
optimized for predicting the specific risk parameter (s) of 
interest* 

Pure premium is perhaps the most important risk 
characteristic because it represents the minimum amount that 
policyholders in a risk group must be charged in order to cover 
the claims generated by that risk group. Actual premiums charged 
are ultimately determined based on the pure premiums of each risk 
group, as well as on the cost structure of the insurance company, 
its marketing strategy, competitive factors, etc. 

If the objective is to predict pure premium, then 
Gallagher's suggestion of segmenting policyholders into risk 
groups based on frequency may be suboptimal because the resulting 
risk groups might not be optimized for predicting pure premium. 
This deficiency exists not only with respect to CHAID, but with 
respect to all other tree-based modeling packages as well, such 
as CART (see L. Breiman et al. above), C4.5 (see J. R. Quinlan, 
1993, above), SPRINT (see J, C. Shafer et al. above), and QUEST 
(see W.-Y. Loh and Y.-S. Shih above) . 

It is possible to directly model pure premium using some 
tree-based modeling packages; however, the resulting models can 
again be suboptimal because the statistical characteristics of 
insurance data are not being taken into account. 

To directly model pure premium using standard tree-based 
techniques, one can make use of the fact well-known to actuaries 
that the pure premium of a risk group can be computed as a 
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weighted average of the pure premiums of the individual 
historical policy and claims data records that belong to the risk 
group. The pure premium of an individual data record is equal to 
the claim amount associated with that data record divided by the 
5 record's earned exposure (i.e., the length of time represented by 
that data record during which the corresponding policy was 
actually in force) . The weighted average is calculated by summing 
the products of each individual pure premium times its 
corresponding earned exposure, and then dividing the resulting 
10 sum by the sum of the earned exposures. 

Loss ratio (i.e., the ratio of claims paid over premiums 
;^ collected) can also be modeled directly in terms of weighted sums 

yj of individual loss ratios. The loss ratio of an individual data 

O 

15 jvj records is equal to the record's claim amount divided by its 
!** earned premium (i.e., the premium charged per unit time for the 
policy times the record's earned exposure) . As is well-known to 
actuaries, the loss ratio of a risk group can be calculated by 
Ij* summing the products of each individual loss ratio times its 
20 :ji corresponding earned premium, and then dividing the resulting sum 
by the sum of the earned premiums . 

"as? 

Several tree-based modeling packages do allow weights to be 
assigned to data records and are thereby able to calculate the 

25 weighted averages needed to directly model pure premium or loss 
ratio. An example of such a package is SPSS's implementations of 
CHAID and CART (see, for example, http://www.SPSS.com). However, 
these packages segment data records based on weighted sums of 
squares calculations. In particular, in the case of pure premium, 

30 risk groups would be identified using numerical criteria that 

rely on calculating sums of weighted squared differences between 



Y0999-214 



individual pure premiums and weighted averages of individual pure 
premiums. In the case of loss ratio, the numerical criteria would 
rely on calculating sums of weighted squared differences between 
individual loss ratios and weighted averages of individual loss 
5 ratios . 

One of the lessons taught by work in robust estimation (see, 
for example , R. R. Wilcox, Introduction to Robust Estima tlon and 
Hypothesis Testing, New York: Academic Press, 1997) is that such 
10 squared difference criteria can yield suboptimal estimation 

procedures for data that have highly skewed distributions. In the 
case of automobile claims data, both the pure premiums and the 
loss ratios of individual claim records do in fact have highly 
y skewed distributions. The squared difference criteria used by 
15 -"l standard tree-based modeling packages are therefore not 

N; well-suited for these data. Segmenting policyholders on the basis 
£ of these criteria may therefore yield suboptimal results relative 

to the results that could be obtained if more robust criteria 
& were employed based on the actual statistical characteristics of 
20 [A the data. 

. :fi 
. 

Actuaries have long recognized that claim- related data tend 
to have highly skewed distributions. An important principle of 
actuarial science is to employ statistical distributions whose 

25 shapes closely match the observed distributions of the data (see, 
for example, S. A. Klugman et al. above) . For example, in the 
case of personal lines automobile insurance, claim events are 
often modeled as Poisson random processes and claim amounts are 
often modeled using log-normal probability distributions. The 

30 importance of employing statistical distributions that match 
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observed distributions of data is likewise demonstrated by work 
in robust estimation. 

Another important distinguishing feature of the present 
invention is that the invention enables any of the statistical 
models employed by actuaries to be used as the basis for 
automatically identifying risk groups, thereby overcoming the 
deficiencies of prior art techniques that are described above. In 
particular, joint Poisson/log-normal models can be used to 
construct risk models for personal lines automobile insurance. 
This same class of models is also suitable for personal lines 
property and casualty insurance in general. 

A second additional aspect in which prior art tree-based 
methods are deficient is that the methods do not take into 
account the fact that some claims can take several years to 
settle, most notable bodily injury claims. Specialized estimation 
procedures are generally required to estimate risk parameters in 
the presence of unsettled claims. As with actuarial credibility, 
the calculations that are needed are specific to the statistical 
models used by actuaries to model insurance risk. Once again, 
standard tree-based methods are not equipped to perform these 
calculations because they are not specifically designed for 
insurance risk modeling purposes. Our method, on the other hand, 
is able to incorporate such estimation procedures. 

As indicated above, we have discerned that the prior art 
methods for automatically constructing segmentation -based models 
have numerous deficiencies that are especially evident from the 
point of view of insurance risk modeling. In sharp contrast, we 
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have now discovered a methodology for constructing 
segmentation -based models that overcomes these deficiencies. 

In a first aspect, the present invention discloses a program 
storage device readable by a machine, tangibly embodying a 
program of instructions executable by the machine to perform 
method steps for constructing segmentation-based models that 
satisfy constraints on the statistical properties of the 
segments, the method steps comprising: 

(1) presenting a collection of training data records 
comprising examples of input values that are 
available to the model together with the 
corresponding desired output value (s) that the 
model is intended to predict; 

and 

(2) generating on the basis of the training data a 
plurality of segment models, that together comprise 
an overall model, wherein each segment model is 
associated with a specific segment of the training 
data, the step of generating comprising performing 
optimization steps comprising: 

a) generating alternate training data segments 
and associated segment models; 

b) evaluating at least one generated segment 
to determine whether it satisfies at least 
one statistical constraint comprising a 
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test whose outcome is not equivalent to a 
comparison between, on the one hand, the 
number of training records of at least one 
species of training records belonging to 
the segment and, on the other hand, a 
numerical quantity that may depend on the 
combination of species of training records 
being considered but that is otherwise 
constant for all generated segments that 
are evaluated; 

and 

c) selecting a final plurality of segment 
models and associated segments from among 
the alternates evaluated that have 
satisfactory evaluations. 

This first aspect of the invention can realize significant 
advantages because it enables complex constraints to be placed on 
the statistical estimation errors that can be tolerated with 
respect to various aspects of the predictive models that are 
constructed. For example, in the context of insurance risk 
modeling, the invention can be used to impose actuarial 
credibility constraints on the risk groups that are constructed. 

In a second aspect, the present invention discloses a 
program storage device readable by a machine, tangibly embodying 
a program of instructions executable by the machine to perform 
method steps for constructing segmentation-based models that 
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satisfy constraints on the statistical properties of the 
segments, the method steps comprising: 

(1) presenting a collection of training data records 
comprising examples of input values that are 
available to the model together with the 
corresponding desired output value (s) that the 
model is intended to predict; 

and 

(2) generating on the basis of the training data a 
plurality of segment models, that together comprise 
an overall model, wherein each segment model is 
associated with a specific segment of the training 
data, the step of generating comprising performing 
optimization steps comprising: 

a) generating alternate training data segments 
and associated segment models using 
statistical constraints to guide the 
construction of the data segments in a 
closed-loop fashion so as to ensure that 
the resulting data segments satisfy the 
statistical constraints; 

and 

b) selecting a final plurality of segment 
models and associated segments from among 
the alternates generated. 
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In a third aspect, the present invention discloses a program 
storage device readable by a machine, tangibly embodying a 
program of instructions executable by the machine to perform 
method steps for constructing segmentation -based models that 
satisfy constraints on the statistical properties of the 
segments, the method steps comprising: 

(1) presenting a collection of training data records 
comprising examples of input values that are 
available to the model together with the 
corresponding desired output value (s) that the 
model is intended to predict; 

(2) generating on the basis of the training data a 
plurality of segment models, that together comprise 
an overall model, wherein each segment model is 
associated with a specific segment of the training 
data, the step of generating comprising: 

a) generating alternate pluralities of data 
segments and associated segment models; 

and 

b) adjusting the alternate pluralities so that 
the resulting data segments satisfy the 
statistical constraints. 

These second and third aspects of the invention can realize 
significant advantages because they enable constraints to be 
placed on the statistical estimation errors that can be tolerated 
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with respect to various aspects of the predictive models that are 
constructed, while at the same time preventing premature 
termination of the segment refinement process caused by poor 
choices made during the construction of the segments. 

In a fourth aspect, the present invention discloses a 
program storage device readable by a machine, tangibly embodying 
a program of instructions executable by the machine to perform 
method steps for constructing segmentation-based models of 
insurance risks, the method steps comprising: 

(1) presenting a collection of training data comprising 
examples of historical policy and claims data; 

and 

(2) generating on the basis of the training data a 
plurality of segment models, that together comprise 
an overall model , wherein each segment model is 
associated with a specific segment of the training 
data, the step of generating comprising performing 
optimization steps comprising: 

a) generating alternate training data segments 
and associated segment models; 

b) evaluating the generated segment models 
using numerical criteria derived from 
statistical models used by actuaries to 
model insurance risks, 
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and 

c) selecting a final plurality of segment 
models and associated segments from among 
5 the alternates generated so as to optimize 

aggregate numerical criteria for the 
plurality. 

This fourth aspect of the invention can realize significant 
10 advantages because the invention can be applied in conjunction 

with any of the various statistical models employed by actuaries 
in order to construct highly predictive risk models that take 
y into account the statistical properties of insurance data 

hi associated with specific insurance products. 

O 

15 

I* For example, the invention can be used in conjunction with a 

:p joint Poisson/log-normal statistical model in order to construct 

risk models for personal lines automobile insurance in 
;jF particular, and personal lines property and casualty insurance in 
20 general . 

The invention can also be advantageously used to model the 
profitability of insurance products by replacing the parameters 
that appear in the statistical models with parametric functions 

25 of the premiums charged to policyholders. In the case of a joint 
Poisson/log-normal model, the frequency and severity parameters 
can be replaced with functions of the premiums charged. In such 
illustrative expressions, the novel method can model the 
relationships between actual insurance risk and an insurer's 

30 existing price structure. In so doing, it can identify factors 
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that distinguish the most profitable policyholders from the least 
profitable ones. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention is illustrated in the accompanying drawing, in 
which: 

Figure 1 provides examples of theoretical histograms defined 
by the log-normal distribution for claim amounts and the 
logarithms of claim amounts; 

Figure 2 provides an illustration of the open-loop approach 
used by prior art methods for splitting larger segments into 
smaller segments in such a way that the smaller segments satisfy 
desired constraints on the statistical properties of the 
segments ; 

Figure 3 provides an illustration of the closed-loop method 
disclosed by the present invention for splitting larger segments 
into smaller segments in such a way that the smaller segments 
satisfy desired constraints on the statistical properties of the 
segments ; 

Figure 4 provides an illustration of how the introduction of 
each successive splitting factor simultaneously increases the 
number of risk groups and decreases the score of the risk model 
on the portion of the training data used for estimating model 
parameters and selecting splitting factors, how the true scores 
of each of the resulting risk models are estimated on the portion 
of the training data used for validation purposes, and how the 
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optimum risk model is selected by choosing the model with the 
smallest score on the validation data; 

Figure 5 provides an illustration of how policy-quarter data 
5 records must be subdivided for those quarters in which claims are 
filed. 

DETAILED DESCRIPTION OF THE INVENTION 

10 As summarized above, the present invention enables 

segmentation-based models to be constructed wherein the model 
building process is constrained to only produce segments with 
1*3 desirable statistical characteristics. In particular, for 
i** insurance risk modeling purposes, the segments can be constrained 

'til? 

15 to be actuarially credible. A second benefit of the invention is 
jet that it can be applied in conjunction with any of the various 

z : : 

"1* statistical models employed by actuaries in order to construct 

highly predictive risk models that take into account the 
J£ statistical properties of insurance data associated with specific 
20 insurance products. 

It is important to point out that the invention is widely 
applicable and its use is in no way restricted to insurance risk 
modeling. Insurance risk modeling is merely an illustrative 
25 example that clearly demonstrates the utility of incorporating 
statistical constraints into the model building process. 
Insurance risk modeling will thus be used as the running example 
to motivate the technical aspects of the invention. 

30 Actuarial science is based on the construction and analysis 

of statistical models that describe the process by which claims 
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are filed by policyholders (see, for example, Klugman et al. 
above) . Different types of insurance often require the use of 
different statistical models. For any type of insurance, the 
choice of statistical model is often dictated by the fundamental 
5 nature of the claims process. For example, for property and 

casualty insurance, the claims process consists of claims being 
filed by policyholders at varying points in time and for varying 
amounts. In the normal course of events, wherein claims are not 
the result of natural disasters or other widespread catastrophes, 
10 loss events that result in claims (i.e., accidents, fire, theft, 
etc.) tend to be randomly distributed in time with no significant 
pattern to the occurrence of those events from the point of view 
of insurable risk. Policyholders can also file multiple claims 
W for the same type of loss over, the life of a policy. Claim 

-a— 
§ s 

15 jy filings such as these can be modeled as a Poisson random process 
(see, for example, Klugman et al. above) , which is the 
=p appropriate mathematical model for events that are randomly 
i3 distributed over time with the ability for events to reoccur 
;f: (i.e., renew) . 

20 =D 

^ The Poisson model is used as an illustrative example in the 

illustrative embodiment of the invention that is presented below. 
However, in the case of insurance risk modeling, the invention 
can also be practiced using other statistical models of claim 
25 events depending on the characteristics of the particular 

insurance product being considered. The invention can likewise be 
practiced in the general context of predictive modeling in 
combination with virtually any kind of statistical model. 

30 Although the Poisson model is generally quite suitable for 

property and casualty insurance, it should be noted that Poisson 



Y0999-214 



• • 

processes are not appropriate for modeling catastrophic claims 
arising from widespread disasters, such as hurricanes, 
earthquakes, floods, etc., because such catastrophes do not 
produce uniform distributions of loss events over time. Instead, 
5 natural disasters lead to clusters of claims being filed over 

short periods of time, where the number of claims in each cluster 
depends both on the number of policyholders in the region 
affected by the disaster and on the severity of the disaster. An 
appropriate statistical model would likewise take into account 
10 the geographic distribution of policyholders relative to the 
damage caused by disasters. 

The Poisson model is also not appropriate for modeling other 
y forms of insurance, such as life insurance in which at most one 
15 Ijl death benefit is ever claimed per policy* Life insurance is best 
j 8 ^ modeled as a survival process, not a renewal process. 

is S 

L Again, it is important to point out that the invention can 

=P be practiced using other statistical models. The Poisson model is 

20 Iq not a requirement, it is merely a convenient example. 

"sir 

In addition to modeling the distribution of claims over 
time, actuaries must also model the amounts of those claims. In 
actuarial science, claim amounts for property and casualty 

25 insurance are modeled as probability distributions. Two kinds of 
distributions are usually considered: those for the amounts of 
individual claims, and those for the aggregate amounts of groups 
of claims. In principle, aggregate loss distributions can be 
derived mathematically from the distributions of the individual 

30 losses that make up the sum. However, only in a few special cases 
can closed-form solutions be obtained for these mathematical 
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equations. In most cases, approximations must be employed. 
Fortunately, actuaries typically consider large groups of claims 
when analyzing aggregate loss. The central limit theorem can 
therefore be invoked and aggregate losses can be reasonably 
5 approximated by normal (i.e., Gaussian) distributions. 

In one examination we made of historical automobile claims 
data, claim amounts were found to have a highly skewed 
distribution. Most claims were small in value relative to the 
10 maximum amounts covered by the policies, but a significant 

proportion of large claims were also present. When the claim 
amounts were logarithmically transformed, the skewness virtually 
O disappeared and the resulting distribution was found to be highly 
ij] Gaussian in shape. These properties are the defining 
15 ^1 characteristics of log-normal distributions, an example of which 
! a * is illustrated in Figure 1. 

• ■ c 
:e ; 

;^ The log-normal distribution is used as an illustrative 

8 j» example in the illustrative embodiment of the invention that is 
20 [Z s presented below. However, as with Poisson models, the invention 
k & can also be practiced using other statistical distributions of 
claim amounts depending on the characteristics of the particular 
insurance product being considered. In particular any of the 
statistical distributions employed by actuaries (see, for 
25 example, Klugman et al. above) can be used. The invention can 
likewise be practiced in the general context of predictive 
modeling in combination with virtually any kind of statistical 
model . 

30 Unfortunately, there are no closed-form solutions for the 

aggregate loss distribution given that individual losses follow a 
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log-normal distribution. In particular, a sum of log-normal 
random variables is not itself log-normal. An approximation must 
therefore be made. In one embodiment of the invention that is 
presented below, the central limit theorem is invoked and the 
normal distribution is used to model aggregate losses. However, 
aggregate loss distributions are not used to identify risk 
groups, they are only used after the fact to estimate the 
aggregate parameters of each risk group. Risk groups are 
identified using numerical criteria that evaluate the predictive 
accuracy of the resulting risk model on individual losses. The 
above approximation for aggregate losses therefore has no effect 
on the risk groups that are identified. 

It should be noted, however, that the invention could 
alternatively be practiced using numerical criteria analogous to 
those employ in CHAID (see Kass above, and Biggs et al. above) 
that would evaluate the statistical significance of observed 
differences in aggregate risk characteristics between alternative 
risk groups. For such criteria, the method of approximation would 
take on greater importance. 

Because different distributions are used to model individual 
versus aggregate losses, different statistical procedures are 
employed for estimating the parameters of these distributions. 
For the log-normal distributions used to model individual losses, 
the relevant statistical parameters are the means and standard 
deviations of the natural logarithms of the individual claim 
amounts. For the normal distributions used to model aggregate 
losses, the means and standard deviations of the (raw) claim 
amounts are the parameters that need to be estimated. 
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As previously discussed, a traditional method used by 
actuaries to construct risk models involves segmenting the 
overall population of policyholders into a collection of risk 
groups based on a set of factors, such as age, gender, driving 
5 distance to place of employment, etc. Actuaries typically employ 
a combination of intuition, guesswork, and trial -and-error 
hypothesis testing to identify suitable factors. The human effort 
involved is often quite high and good risk models can take 
several years to develop and refine. 

10 

The invention replaces manual exploration of potential risk 
factors with automated search. When the invention is applied to 
D insurance risk modeling, risk groups are preferably identified in 
y a top-down fashion by a method that is similar in spirit to those 
15 ]^ employed in prior art algorithms such as CHAID (see Kass above, 
j'f and Biggs et al. above), CART (see Breiman et al. above), C4.5 
J= (see Quinlan, 1993, above), SPRINT (see Shafer et al. above), and 
L. QUEST (see W.-Y. Loh and Y.-S. Shih above) . Starting with an 
=F overall population of policyholders, policyholders are first 
20 ifl divided into at least two risk groups by identifying the risk 

factor that yields the greatest increase in predictive accuracy 
given the risk groups that are produced, subject to the 
constraint that the resulting risk groups must be actuarially 
credible. Each resulting risk group is then further subdivided by 
25 identifying additional factors in the same manner as before. The 
process is continued until the resulting risk groups are declared 
either to be homogeneous (i.e., further subdivisions do not 
increase predictive accuracy) or too small to be further 
subdivided from the point of view of actuarial credibility. 

30 
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In the general context of predictive modeling, risk groups 
correspond to population segments and actuarial credibility 
corresponds to a statistical constraint on the population 
segments. An important difference between the invention and prior 
art methods is that the method of identifying splitting factors 
for dividing larger population segments into smaller population 
segments is preferably constrained so that the resulting segments 
satisfy desired statistical constraints, where the constraints 
can be arbitrarily complex. In particular, the constraints are 
not restricted to the prior art technique of imposing global 
thresholds on the number of training records of various types 
that belong to each segment. In the case of risk modeling for 
property and casualty insurance, actuarial credibility 
constraints do correspond to thresholds on the number of claim 
records that belong to each segment (i.e., risk group); however, 
the thresholds are not global constants, but instead are 
functions of the statistical properties of the claim amounts for 
each segment. The thresholds can thus vary from one segment to 
another . 

Another important difference between the invention and prior 
art methods is that, for the purpose of insurance risk modeling, 
splitting factors are preferably selected based on numerical 
optimization criteria derived from statistical models of 
insurance risk. For example, in the case of the illustrative 
embodiment of the invention presented below, a joint 
Poisson/log-normal model is used in order to simultaneously model 
frequency and severity, and thereby pure premium. Splitting 
factors are selected in this example by minimizing a negative 
log-likelihood criterion derived from the joint 

Poisson/log-normal model. Minimizing this criterion maximizes the 
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likelihood of the data given the joint Poisson/log-normal model , 
and it thereby maximizes the predictive accuracy of the resulting 
risk model. Tailoring the choice of statistical model to the 
specific characteristics of the insurance product being 
considered can yield more accurate risk models than could be 
obtained using, for example, a conventional least-squares (i.e., 
Gaussian) model. 

If the invention were practiced using other statistical 
models, the same methodology for identifying risk factors would 
preferably be employed, except that the optimization criteria 
would preferably change according to the statistical models that 
are used. For example, in the case of profitability modeling, the 
statistical models could quantify insurance risk as a function of 
the premium charged. The optimization criteria derived from the 
resulting statistical models would then cause risk factors to be 
identified that yield the greatest increase in predictive 
accuracy with respect to estimated loss ratio instead of 
estimated risk. Suitable optimization criteria for other forms of 
insurance could likewise be derived from statistical models 
appropriate for those products. 

Any prior art method for identifying splitting factors can 
be modified to meet the requirements of the invention by suitably 
constraining the methods to always produce splits that satisfy 
the desired statistical constraints on the resulting segments. In 
addition, for the purpose of insurance risk modeling, the 
numerical criteria employed by those methods would preferably be 
replaced with numerical criteria derived from statistical models 
of insurance risk. For example, the methods employed in CHAID 
(see Kass above, and Biggs et al. above) , CART (see Breiman et 
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al. above), C4.5 (see Quinlan, 1993, above), SPRINT (see Shafer 
et al. above), and QUEST (see W.-Y. Loh and Y.-S. Shih above) can 
all be modified for use with the invention. 

Of these, we prefer a modified version of the bottom-up 
merging technique used in CHAID. This preferred, modified version 
always attempts to produce two-way splits in the case of 
non-missing values, and it avoids producing segments that fail to 
satisfy the desired statistical constraints. The preferred method 
for dividing a larger segment into two or more smaller segments 
proceeds as follows: 

1) For each explanatory data field (i.e., data fields whose 
values are allowed to be used to distinguish one 
population segment from another) , divide the larger 
segment into smaller, mutually-exclusive segments based 
on the possible values of that explanatory data field in 
the same manner as done in CHAID. 

Thus, in the case of a categorical data field, each of 
the smaller segments corresponds to one of the category 
values admitted under the definition of the larger 
segment. If the data field is not mentioned in the 
definition of the larger segment, then a smaller segment 
is constructed for each possible category value for that 
field. If the definition of the larger segment restricts 
the value of the data field to a subset of category 
values, then smaller segments are constructed only for 
category values in that subset. In both cases, it is 
possible that some category values correspond to missing 
values for the data field. 
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In the case of a numerical data field, the possible 
values of the data field are discretized into ordinal 
classes as described by Biggs et al. (see Biggs et al. 
above) and segments are constructed for each of these 
ordinal classes. Segments must also be constructed for 
additional "floating" categories (see Kass above) that 
correspond to missing values for the data field. 

2) For each explanatory data field, set aside those segments 
constructed in step 1 that admit missing values for the 
explanatory field and perform the following merge steps 
on the remaining segments for the explanatory field: 



a) For nominal explanatory fields, merge together 
all remaining segments for which the record count 
for at least one species of training records 
belonging to the segment lies below a given 
threshold for that species. 

b) For ordinal explanatory fields, if all remaining 
segments have at least one training record 
species count that lies below the corresponding 
threshold referred to in step 2a, then merge all 
remaining segments together. Otherwise, 
repeatedly select and merge pairs of remaining 
segments that satisfy the following conditions 
until the conditions can no longer be satisfied 
or until a single segment is obtained: 



The values of the explanatory field 
that are admitted by the two segments 
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to be merged are adjacent with respect 
to the ordering of the values for that 
ordinal explanatory field. 



5 ii) At least one training record species 

count for one of the segments to be 
merged lies below the corresponding 
threshold referred to in step 2a, while 
all training record species counts for 
10 the other segment in the pair lie above 

the corresponding thresholds. 

3) For each explanatory data field, set aside those segments 
Lj that admit missing values for the explanatory field. If 

15 n] two or more segments remain, then repeatedly select and 

!" merge pairs of the remaining segments for the explanatory 

y i 

=|S field so as to optimize the desired numerical criteria 

Hi 

i«=i for selecting splitting factors subject to the following 

"ff s conditions : 

20 la 

a) If at least one of the remaining segments does 
not satisfy the desired statistical constraints 
for segments, then at least one of the segments 
in the pair being merged must likewise not 
25 satisfy the statistical constraints. 



b) In the case of ordinal data fields, the values of 
the explanatory field that are admitted by the 
two segments being merged must be adjacent with 
30 respect to the ordering of the values for that 
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ordinal explanatory field. 

Continue the merging process until only two segments 
remain (i.e., not including the segments that were set 
5 aside that admit missing values for the field) . If one of 

these two remaining segments does not satisfy the desired 
statistical constraints for segments, then merge the two 
remaining segments into a single segment. 

10 4) For each explanatory data field, set aside those segments 

that admit missing values for the explanatory field. If a 
single segment remains, then eliminate the explanatory 
data field from further consideration provided at least 
y one of the following conditions hold: 

15 I 

a) the single remaining segment does not satisfy the 
*p desired statistical constraints for segments; 

20 £ 

]% b) the single remaining segment does indeed satisfy 

the desired statistical constraints for segments, 
but no segments were set aside that admit missing 
values for the explanatory field. 

25 

5) If all explanatory data fields were eliminated in step 4, 
then the larger segment cannot be divided into smaller 
segments that satisfy the desired statistical constraints 
for segments. Subdivision cannot be performed; therefore, 
30 stop any further attempt to divide the larger segment. 
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6) Otherwise, for each explanatory data field that was not 
eliminated from consideration in step 4, and for each 
smaller segment that was constructed for the explanatory 
field that was set aside in step 3 because it admits 
missing values, if the smaller segment does not satisfy 
the desired statistical constraints for segments, then 
assign to the smaller segment the segment model of the 
larger segment from which the smaller segment was 
originally obtained in step 1 . 

7) Select the explanatory data field that was not eliminated 
in step 4 for which the segments constructed for that 
data field optimize the desired numerical criteria for 
selecting splitting factors. 

8) Divide the larger segment into the smaller segments 
constructed for the explanatory data field selected in 
step 7. 

Step 1 is the same initial step performed by CHAID. Step 2 
has no counterpart in CHAID, nor in any other prior art 
tree-based modeling technique. Step 2 is introduced in order to 
stabilize the merging process performed in step 3. The premerging 
performed in step 2 effectively eliminates spurious segments that 
are too small to meaningfully compare with other segments. The 
thresholds in step 2 should be set as small as possible, but 
large enough to yield meaningful comparisons among pairs of 
segments in step 3. 

For example, in the case of personal lines property and 
casualty insurance, the only training record species that matters 



Y0999-214 



32 



t • 

in the determination of actuarial credibility is the set of claim 
records. Consequently, only a single threshold needs to be 
considered in step 2 for this single species. Generally speaking, 
standard deviations of automobile insurance claim amounts tend to 
5 be about the same order of magnitude as mean claim amounts. In 
light of this very large variance in claim amounts, a minimum 
threshold of 6 claim records was found to be necessary to produce 
acceptable results. In general, actuarial credibility must be 
assessed relative to the specific type of insurance risk being 
10 modeled (see, for example, Klugman et al. above) . A concrete 

example of how to assess actuarial credibility is presented below 
in the case of personal lines property and casualty insurance. 

yj Steps 3 through 8 form the counterpart to the bottom-up 

15 m merging process employed by CHAID. In our method, however, the 

^ process is constrained to always produce segments that satisfy 

£% desired statistical constraints on the segments (e.g., actuarial 

s«% credibility constraints in the case of insurance risk modeling) . 
'15 

3 i 5 

20 ?J1 An important aspect of the present invention that 

'% distinguishes it from prior art methods is the fact that 

statistical constraints are applied as an integral part of the 
method for splitting larger segments into smaller segments. 
Statistical constraints are applied as splits are being 

25 constructed in order to guide the construction process . In sharp 
contrast, prior art methods, such as CHAID (see Kass above, and 
Biggs et al. above), CART (see Breiman et al. above), C4.5 (see 
Quinlan, 1993, above) , SPRINT (see Shafer et al. above) , and 
QUEST (see W.-Y. Loh and Y.-S. Shih above), apply corresponding 

30 statistical constraints only after splits have been constructed. 

This methodology is illustrated in Figure 2. A deficiency of this 
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prior art method is that splits may be constructed that violate 
the statistical constraints, causing them to be eliminated from 
further consideration, even though it may have been possible to 
construct alternate splits that actually do satisfy the 
statistical constraints. 

By using the statistical constraints to guide the 
construction of splits, our method is able to produce splits that 
satisfy the statistical constraints whenever it is feasible to do 
so. Our method thereby avoids premature termination of the 
segment refinement process caused by poor choices made during the 
construction of splits. 

The above distinction between the present invention and 
prior art is analogous to the distinction between closed-loop and 
open-loop control systems (see, for example, J. J. Distefano, 
A. R. Stubberud, and I. J. Williams above). The prior art 
approach is open-loop in the sense that the statistical 
constraints, which play the role of error signals, are evaluated 
only after the splits have been constructed (see Figure 2) . Poor 
choices made during the construction of the splits can result in 
segments that violate the statistical constraints even though it 
may have been possible to construct splits that satisfy the 
constraints. In the case of the present invention, on the other 
hand, the statistical constraints are repeatedly evaluated while 
constructing splits to determine whether or not they hold, and 
the results of the evaluations are used to regulate the 
construction process. This closed-loop methodology, which is 
illustrated in Figure 3, ensures that the statistical constraints 
will be satisfied whenever it is feasible to do so. 
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Another distinguishing feature of the present invention is 
that complete flexibility is allowed in steps 3 through 8 with 
regard to the numerical criteria used to select splitting 
factors. The criteria need not be restricted to the chi-squared 
statistical significance criteria employed by CHAID. For example, 
in the illustrative example presented below, maximum likelihood 
criteria are used that are derived from a joint 
Poisson/log-normal statistical models of insurance risk. 

The bottom-up merging process presented here also differs 
from the one used by CHAID in the way that missing values are 
treated. Steps 3 through 8 preferably construct separate segments 
for cases in which the values of explanatory data fields are 
missing. The CHAID approach of merging missing-value segments 
with other segments constructed on the basis of known values of 
the corresponding data field would bias the parameter estimates 
of the segment model for the merged segment. For insurance risk 
modeling purposes, actuaries generally prefer unbiased estimates 
of the risk parameters of each risk group (i.e., segment). For 
other 'applications , steps 3 through 8 could be modified to allow 
missing-value segments to be merged with other segments either by 
not setting missing-value segments aside in steps 3 through 8, or 
by performing an additional step between steps 3 and 4 to 
explicitly merge missing-value segments with other segments and 
then not setting missing-value segments aside in steps 4 
through 8 . 

Another aspect in which the bottom-up merging process 
presented here differs from the one used by CHAID is that steps 3 
through 8 always produces either one-way or two-way splits for 
non-missing values, whereas CHAID can produce multiway splits. 
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The bottom-up merging process defined by steps 3 through 8 is 
designed in such a way that any multiway split that can be 
obtained by prematurely terminating step 3 can also be obtained 
by performing steps 3 through 8 several times on the same data 
field. However, after performing a two-way split on one data 
field, it is often the case that splitting on other data fields 
then produces more accurate predictive models than repeatedly 
splitting on the same data field (i.e., performing a multiway 
split) . It is for this reason that multiway splits are not 
preferred in our invention. 

To construct an overall predictive model, the above method 
for dividing larger segments into two or more smaller segments is 
first applied to the overall population of training records being 
considered. The splitting method is then repeatedly applied to 
each resulting segment until further applications of the 
splitting method are either no longer possible or no longer 
beneficial. The statistical constraints used by the splitting 
method provide one set of criteria that are used to decide when 
to stop splitting. However, another criterion must be applied to 
avoid overfitting. 

Overfitting occurs when the best model relative to a set of 
training data tends to perform significantly worse when applied 
to new data. In the illustrative embodiment of the invention that 
is presented below, a negative log-likelihood criterion derived 
from a joint Poisson/log-normal model of the claims process. This 
negative log-likelihood criterion is essentially a score that 
measures the predictive accuracy of the model. Risk groups are 
identified by searching for splitting factors that minimize this 
score with respect to the training data. However, the score can 
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be made arbitrarily small simply by introducing enough splitting 
factors. As more splitting factors are introduced, a point of 
overfitting is reached where the value of the score as estimated 
on the training data no longer reflects the value that would be 
obtained on new data. Adding splitting factors beyond this point 
would simply make the model worse. 

Overfitting mathematically corresponds to a situation in 
which the score as estimated on the training data substantially 
underestimates the expected value of the score that would be 
obtained if the true statistical properties of the data were 
already known. Results from statistical learning theory (see, for 
example, V. N. Vapnik, The 27ature of Statistical Learning Theory, 
New York: Springer-Verlag, 1995, and V. N. Vapnik, Statistical 
Learning Theory, New York: John Wiley & Sons, 1998) demonstrate 
that, although there is always some probability that 
underestimation will occur for any given model, both the 
probability and the degree of underestimation are increased by 
the fact that we explicitly search for the model that minimizes 
the estimated model score. This search biases the difference 
between the estimated model score and the expected value of that 
score toward the maximum difference among competing models. 

Our preferred method for avoiding overfitting involves 
randomly dividing the available training data into two subsets : 
one that is used for actual training (i.e., for selecting 
splitting factors and estimating the parameters of segment 
models) ; the other that is used for validation purposes to 
estimate to true performance of the model. As splitting factors 
are introduced by minimizing the score on the first of these 
subsets of training data, a sequence of predictive models is 
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constructed in which each successive model contains more segments 
than its predecessors. The true score of each predictive model is 
then estimated by evaluating the negative log-likelihood 
criterion on the validation data for each segment in the 
predictive model and summing the results. The predictive model 
that minimizes this estimate of the true score is selected as the 
most accurate predictive model given the available training data. 
The overall process is illustrated in Figure 4. The reduced error 
pruning method described by Quinlan (see J. R. Quinlan, 
"Simplifying decision trees International Journal of 
Man-Machine Studies, Vol. 27, pp. 221-234, 1987) provides an 
efficient approach for implementing the preferred method for 
avoiding overfitting. 

It is important to note that the phenomenon illustrated in 
Figure 4 occurs even though statistical constraints on the 
segments are incorporated into the method for selecting splitting 
factors. Such statistical constraints (e.g., actuarial 
credibility) do not prevent overfitting in and of themselves. A 
separate mechanism is needed to avoid overfitting, such as that 
described above. 

To reduce the above method for constructing predictive 
models to a particularized expression, all that is required is to 
construct appropriate numerical criteria for selecting splitting 
factors, and to incorporate appropriate statistical constraints 
for the segments. The construction of these components will now 
be illustrated in the case of a joint Poisson/log-normal model 
suitable for personal lines property and casualty insurance. The 
same joint Poisson/log-normal model will then be used to 
illustrate the general method for construction profitability 
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models from statistical models of insurance risk. Finally, an 
example will be presented to illustrate how statistical 
constraints can be developed for use with conventional 
statistical models; in particular, for weighted least-squares 
5 models of the kind found in prior art regression tree methods 
such as CART (see L. Breiman et al. above) and SPSS' s 
implementation of CHAID (see, for example, http://www.SPSS.com). 

The optimization criterion that will be constructed for 
10 identifying splitting factors is based on the principles of 
maximum likelihood estimation. Specifically, the negative 
log-likelihood of each data record is calculated assuming a joint 
Poisson/log-normal statistical model, and these negative log 
W likelihoods are then summed to yield the numerical criterion that 
15 nj is to be optimized. Minimizing this negative log-likelihood 

^ criterion causes splitting factors to be selected that maximize 

=F the likelihood of the observed data given the joint 

§3 Poisson/log-normal models of each of the resulting risk groups. 

20 \'Q To derive equations for the negative log-likelihood 

^ criterion, it is necessary to examine the representation of 

claims data in more detail. Historical data for each policy must 
be divided into distinct time intervals for the purpose of 
predictive modeling, with one data record constructed per policy 

25 per time interval. Time-varying risk characteristics are assumed 
to remain constant within each time interval; that is, for all 
intents and purposes their values are assumed to change only from 
one time interval to the next. The choice of time scale is 
dictated by the extent to which this assumption is appropriate 

30 given the type of insurance being considered and the business 
practices of the insurer. For convenience, quarterly intervals 
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will be assumed to help make the discussion below more concrete, 
but it should be noted that monthly or yearly intervals are also 
possible. 

Assuming that data are divided into quarterly intervals, 
most data records will span entire quarters, but some will not. 
In particular, data records that span less than a full quarter 
must be created for policies that were initiated or terminated 
mid-quarter, or that experienced mid-quarter changes in their 
risk characteristics. In the case of the latter, policy-quarters 
must be divided into shorter time intervals so that separate data 
records are created for each change in the risk characteristics 
of a policy. This subdivision must be performed in order to 
maintain the assumption that risk characteristics remain constant 
within the time intervals represented by each data record. 

One particular case in which subdivision must occur is when 
claims are filed under a policy in a given quarter. The filing of 
a claim can itself be an indicator of future risk (i.e., the more 
claims one files, the more likely one is to file future claims) . 
Claim events must therefore be treated as risk characteristics 
that can change mid-quarter. However, claim events are special in 
that there is a second reason for subdividing policy-quarters 
when claims are filed. In order to reduce storage requirements, 
data records typically contain only one severity field per 
coverage. For example, in the case of automobile insurance, there 
would be one claim amount listed for property damage, one for 
bodily injury, one for collision, and so forth. Consequently, if 
one were to use a single record to represent information about 
two or more claims, one would only be able to record the total 
claims for each coverage, not the individual claim amounts. 
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However, we need to know the individual claim amounts in order to 
correctly determine which claims involved which combinations of 
coverages. Both pieces of information — the exact combination of 
coverages for each claim and the individual claim amounts for 
each coverage — are needed to correctly estimate the frequency and 
severity parameters of the statistical models. Subdividing 
policy-quarters when claims are filed ensures that all relevant 
information about each claim is preserved. 

The above method of decomposing policy and claims data into 
a collection of time intervals requires that the equations for 
calculating negative log-likelihoods be decomposed in a similar 
fashion. To motivate the decomposition, let us first consider the 
case in which there is a single homogeneous risk group whose risk 
characteristics remain constant over time. Policy-quarters are 
then subdivided only when claims occur. 

Figure 5 depicts the database records that are constructed 
in this situation, QO, Ql, Q2, etc., represent the ending days of 
a sequence of quarters. TO represents the day on which a 
particular policy came into force, while Tl represents the day 
the first claim was filed under that policy. Though not 
illustrated, T2, T3, T4 , etc., would represent the days on which 
subsequent claims were filed. For modeling purposes, the policy 
claims data is divided into a sequence of database records with 
earned exposures tl, t2, t3, etc. 

As Figure 5 illustrates, new policies typically come into 
force in the middle of quarters. Thus, the earned exposure for 
the first quarter of a policy's existence (e.g., tl) is generally 
less than a full quarter. The earned exposures for subsequent 
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quarters, on the other hand, correspond to full quarters (e.g., 
t2, t3, and t4) until such time that a claim is filed or the 
policy is terminated. When a claim is filed, the data for that 
quarter is divided into two or more records. The earned exposure 
5 for the first database record (e.g., t5) indicates the point in 

the quarter at which the claim was filed. The earned exposure for 
the second record (e.g., t6) indicates the time remaining in the 
quarter, assuming only one claim is filed in the quarter as 
illustrated in the diagram. If two or more claims are filed in 
10 the quarter, then three or more database records are constructed: 
one record for each claim and one record for the remainder of the 
quarter (assuming that the policy has not been terminated) . 

O 

U For Poisson random processes, the time between claim events 

15 iy follows an exponential distribution. Moreover, no matter at what 
!** point one starts observing a Poisson process, the time to the 
s |a next claim event has the same exponential distribution as the 
f1 time between claim events. For the example shown in Figure 5, the 
+: probability density function / for the time (T1-T0) between the 
20 =|3 policy inception and the first claim being filed is given by 

- /Tl -TO) = jte-'C"-™) = Xe~ mM1+ ^^ 5) , 

where X is the claim frequency of the risk group. This 
probability density function can be decomposed to reflect the 
earned exposures of each database record by using the chain rule 
25 of probability theory in the following manner: 
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/Tl-TO) =P{Tl-TO>tl} -P{Tl-T0>tl+t2 | Tl-TO>tl} 

.P{Tl-T0>tl+t2 + t3 | Tl-T0>tl+t2} 
■P{Tl-T0>tl+t2 + t3+t4 | Tl-T0>tl+t2 + t3} 
•/Tl-T0 = tl+t2 + t3+t4 + t5 | Tl-T0>tl+t2+t3+t4) 
= e ~ m • e" i(t2) • e~ m • e~ m • Jle~ m 

_ ^ e -i(tl+t2+t3+t4+-t5) 
= ^e" i(T1_T0) 

Thus, according to this decomposition, each nonclaim record 
(i.e., one that does not describe a claim event) can be assigned 
a probability of e' h , where t is the earned exposure of that 
record. Each claim record, on the other hand, can be assigned a 
probability density of Xe~ Xt . The probability density for the time 
of the first claim filing is then obtained by multiplying these 
assigned probabilities/densities . 

The above decomposition likewise holds for two or more 
claims. Suppose that a total of k+l claim have been observed, 
where k is the number of fully settled claims and / is the number 
of claims that are still open. The probability density for the 
time T between £+/ claim filings is given by 

j{T | k + l) = n~ XT . (1) 

The same probability density is obtained by multiplying the 
probabilities/densities assigned to the individual database 
records that are involved. As a result of the multiplication, T 
will equal the total earned exposure as calculated from the 
database records, and k+l will equal the total number of claim 
records. The important thing to note about this density function 
is that it is not the exact times at which the claims are filed 
that matters in the calculation, what matters are the number of 
claims and the total earned exposure. 
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Although the decomposition developed above was motivated by 
assuming that only one risk group exists, the decomposition also 
holds for multiple risk groups. In the latter case, each database 
record is assigned to a risk group according to the risk 
characteristics that define each group. The probability density 
function for the total earned exposure of a risk group is then 
calculated by multiplying the probabilities/densities assigned to 
the database records within the risk group. The resulting 
probability density function has the same form as that given in 
Equation (1) . 



The maximum likelihood estimate of the frequency parameter X 
is obtained by maximizing the value of Equation (1) . The 
resulting formula for estimating X is the same one that is 
typically used by actuaries: 

X- - Total Number of Claims , 2 \ 
T Total Earned Exposure 

The probability functions that govern claim severity are 
somewhat easier to derive. In the case of nonclaim records, the 
severity s is always zero by definition. Thus, 

P{s = 0 | Nonclaim Record} =1, 
P{s±0 | Nonclaim Record} =0. 

In the case of claim records, the severity is assumed to follow a 
log-normal distribution, which is defined by the following 
probability density function: 

2 



(lQg(5)-//log) 

As) = ~J= -e 2a ^ , (4) 

s^2n ciog 
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where log(s) is the natural logarithm of s (i.e., base e) , ju\ og is 
the mean of log(s), and er? og is the variance of log(s). The mean ju and 
variance a 2 of the severity s are related to the mean and 
variance of the log severity by the following equations: 



1-2 f 1 \ 

/ ,og -i 

V J 



Ji log + "2<7 
li - e " , = JU 



lo e -2 _ „2 



Equations (3) and (4) define a probability decomposition for 
severity that is analogous to the one developed above for 
frequency. In this case, each nonclaim record is assigned a 
probability of one as per Equation (3) , while each fully settled 
claim record is assigned the probability density defined by 
Equation (4) . The product of these probabilities/densities yields 
the joint probability density function for the severities of the 
settled claims that were filed (open claims are treated 
separately as discussed below) : 

^f=iOog(^/)-^iog) 2 



U27T(7log) njLiS/ 

where k is the number of settled claims and are the claim 

amounts. Note that this method of calculation assumes that the 
severities are statistically independent and identically 
distributed random variables, which is an appropriate assumption 
for homogeneous risk groups. 



Given the joint probability density function above, the mean 
and variance of the log severity are estimated using the 
equations 
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(5) 



and 



5 ?o g = 2j(log(j/) -^io J / ( 6 ) 

respectively. Equations (5) and (6) are used during training to 
estimate the parameters of the loss distribution for individual 
claims defined by Equation (4) . These estimators presume that the 
individual loss distributions are log-normal. Aggregate losses, 
however, are presumed to be normally distributed as previously 
discussed. The usual unbiased estimators for the mean and 
variance of the severity are therefore used after the risk model 
has been constructed in order to estimate the parameters of the 
aggregate loss distributions: 




It is important to note that only fully settled claims 
should be considered when applying Equations (5-8) . The severity 
fields of unsettled claims are often used to record reserve 
amounts; that is, the money that insurers hold aside to cover 
pending claims. Reserve amounts are not actual losses, nor should 
they be used to develop models for predicting actual losses. 

As mentioned earlier, negative log-likelihoods are 
calculated for each database record in a risk group. The 
nonconstant terms in the negative log-likelihoods are then summed 
and used as the criterion for selecting splitting factors in the 
top-down identification of risk groups. The constant terms, on 



Y0999-214 



46 



the other hand, do not contribute to the selection of splitting 
factors and are omitted to avoid unnecessary computation. 

The negative log-likelihood of database record / is given by 

-logfe(f,)g(j,)] =-log[gfo)] -log[g(j/)] , 

where g(U) and g{si) are the probabilities/densities assigned to 
record i for the earned exposure /, and the severity s if 
respectively, of record i. From the discussion above, 

. J Xtu for non-claim records 

-logteWJ -i xtt-XogiXX for claim records <9> 



:J3 

10 i: | and 



-lOg^y)] = 



0, for non-claim records 

\og(j2n Si) + log((7i 0 g) + ^°^ 0 ^2^ l0g ^ > ^ or sett led claim records. ^ 10 * 



3 p To obtain the optimization criterion, the first thing to 

[f]l note about the above equations is that, when summed over all risk 
groups, the log(<j2itSi) terms obtained from Equation (10) are 
15 constant across all possible ways of dividing policyholders into 
risk groups. The values of these terms depend only on the 
severities of the claims and not on the parameters of the risk 
groups. These terms can therefore be dropped for optimization 
purposes. After removing constant terms, the negative 
20 log-likelihood of severity for settled claim records becomes 

if ^ , (log(^)-^iog) 2 

ZCr log 
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The second thing to note in deriving an optimization 
criterion is that, in the case of open claim records, the value 
of the settled claim amount in the above formula is unknown. 
However, open claim records are still highly relevant with regard 
to selecting splitting factors; in particular, they used in 
Equation (9) for calculating the negative log-likelihood of the 
earned exposure. For open claim records, an estimated negative 
log-likelihood of severity is therefore calculated by taking the 
expected value of the formula for settled claim records. The 
expected value is 

log(aio g ) + T / 

which, after removing constant terms, reduces to 

log(criog) . 

When combined, the above formulas yield the following 
equation for the negative log-likelihood of database record / 
with constant terms removed: 

for non-claim records 

for open claim records 

F (11) 

for settled claim records. 

Equation (11) can be thought of as the score of the / f th database 
record. If the records for a risk group contain k settled claims 
and / open claims, then the sum of these scores is given by 

k 

£(log(s ; )-// IO g) 2 , (12) 

;=1 



log 



£=4!<,W + /)iog(^ + U 
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where N is the total number of database records for the risk 
group, the first k of which are assumed for convenience to be the 
settled claim records, the score of the overall risk model is 
obtained by summing Equation (12) over all risk groups. Risk 
models are constructed by minimizing this overall score in a 
stepwise fashion, where each step involves dividing a larger 
risk group into two or more smaller risk groups so as to reduce 
the value of the overall score to the maximum extent possible. 

In addition to supplying numerical criteria for selecting 
risk factors, such as Equation (12) , an appropriate test for 
actuarial credibility must also be provided to reduce the 
invention a particularized method for constructing risk models. 
Actuarial credibility (see, for example, Klugman et ai. above) 
has to do with the accuracy of the estimated risk parameters — in 
this case, frequency, severity, and ultimately pure premium. 
Accuracy is measured in terms of statistical confidence 
intervals; that is, how far can the estimated risk parameters 
deviate from their true values and with what probability. A fully 
credible estimate is an estimate that has a sufficiently small 
confidence interval. In particular, estimated parameter values X 
must be within a certain fraction r of their true (i.e., 
expected) values E\X\ with probability at least pi 



Typical choices of r and p used by actuaries are r = 0.05 and /? = 0.9. 
In other words, X must be within 5% of E\X\ with 90% confidence. 




The above credibility constraint can be converted to an 
equivalent, more convenient constraint on the variance of X. For 
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any combination of values for r and p, there exists a value for r f 
such that 



X~E[X] 



E[X\ 



< r 



> p if and only if — E[X\ — ~ 



The value of r l is essentially the maximum allowed fractional 
standard error of X. For example, if p-Q.9 and X has a Gaussian 
distribution, then the 90% confidence interval for X is ±1.645 
times the standard deviation of X centered about its mean. If in 
addition r = 0.05, then 



r - 



r 0.05 
1.645 " 1.645 



= 0.0304 



(13) 



10 y Thus, X will be within 5% of E[X\ with 90% confidence provided 
jlj the standard error of X is within 3.04% of E\X\ . 



4 a To ensure that actuarially credible risk groups are 

□ constructed, a limit can be placed on the maximum fractional 
15 standard error for the estimated pure premiums of each risk 

m group. The method of subdividing larger risk groups into smaller 
risk groups will then ensure that this constraint will be obeyed 
by all of the risk groups that are produced. Actuarial 
credibility is thus ensured. The ability to impose actuarial 
20 credibility constraints on the top-down process by which risk 
groups are constructed is another important feature that 
distinguishes the present invention from all other tree-based 
modeling methods. 

25 To derive equations for the credibility constraint, let us 

ignore the issue of open claims for the moment and simply suppose 
that K claims are filed by a given risk group with severities 
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S\,...,Sk. Suppose further that T is the total earned exposure over 
which these observations were made. Then the estimated pure 
premium X for the risk group is given by 



1 

X — Tjn ^ Sj 
1 Z=l 



As is usually the case in practice , T is assumed to be given 
while K and Si^.^Sk are random variables. The expected value of 
the pure premium estimate given AT is simply 

41 K 1 1 K 1 K ^ 

/=1 J 1 i=\ 1 /=1 7 



10 ;s 

■y 



where ^ is the mean severity. The (unconditional) expected pure 
premium is thus 

E[X]=E K [E[X\K]]=£[^-] = ^ L = Xfi , 



■A* 



where X is the claim frequency. Similarly, the expected square of 
the pure premium estimate given K is 



£[JP \K\=E 



1 * V 

~T ^ <S"/ 
Y (=1 y 



is: 



r 2 



z=l /=1 ;=1 

7*' 



*T K K 

;=1 (=1 j=\ 



Ka 2 +K 2 f i 2 
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where o 2 is the variance of severity. The (unconditional) 
expected square of the pure premium estimate is thus 



EIX 2 ] = Ek^X 2 \K])=E 



K^+K 2 ^ 2 

T2 



XTa 2 +(XT+X 2 Py X(o 2 +n 2 ) n 2 

rj*2 rp A JLI 



Combining the above equations yields the following formula for 
the fractional standard error of the pure premium estimate: 



e\x\ - y Eye]* ~ J ~ V XT 



(14) 



In the above equation, note that XT is the expected number 
of claims filed by the policyholders in a risk group given a 
total earned exposure of T. An upper bound on the fractional 
standard error of the pure premium estimate can thus be expressed 
in terms of an equivalent lower bound on the expected number of 
claims filed: 



ifandonlyif XT > -^{l + . 

For example , setting r 1 to the value given in Equation (13) 
yields the lower bound typically used by actuaries when 
constructing risk models for property and casualty insurance 
(see, for example, Klugman et al. above) : 



To generalize Equation (14) to the case in which some of the 
claims are not fully settled, it is useful to note that, when 
Equation (14) is applied in practice, l f \i f and a are replaced 
with their estimated values 2/ anc * ^ given by Equations (2) , 
(7) , and (8) , respectively: 
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For the moment, the number of open claims / is assumed to be 
zero. In the above expression, the Ilk term is the fractional 
standard error squared of the frequency estimate, as evidenced by 
the fact that 

XL 

Var[X] _ J 1 L J_ 

E[H] 2 X2 ~ XT ~ * 

The a 2 Ik term is the standard error of the estimated mean ft, so 
that a 2 l(ji 2 k) is the fractional standard error squared of the 
severity estimate. 



When the number of open claims / is greater than zero, the 
fractional standard error squared of the severity estimate 
remains the same because only settled claims are used to estimate 
severity; however, the fractional standard error squared of the 
frequency estimate now becomes 



Var[X\ 

E(xy 



JL 
XT 



1 



k+l 



An appropriate generalization of Equation (14) that accounts for 
open claim records is therefore 



E[X] 



k+r k 



a 



s2 



The above equation can be further specialized by making use of 
the fact that, for log-normal distributions, 
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The credibility constraint for joint Poisson/log-normal models 
therefore simplifies to 



Jy*m I"! if 5f og 

FFvi — ~ T~T7 + i e 




1 <r' 



(15) 



For computational reasons, it is often desirable to identify 
risk groups based on accident-enriched, stratified training data 
sets. Stratified sampling has the benefit of reducing the amount 
of computation that is required, but it biases the values of the 
model parameters that are estimated for each risk group. After a 
risk model is constructed on the basis of stratified data, a 
postprocessing step is recommended to obtain unbiased estimates 
of the model parameters on a separate, unstratified calibration 
data set. 

When setting the value of r f in such cases, actuaries will 
likely be thinking in terms of the maximum fractional standard 
error of pure premium after calibration is performed. However, 
the constraint is applied only during training. Assuming that the 
resulting risk groups are indeed credible, we would expect the 
ratios of training claim records to calibration claim records to 
be approximately the same across risk groups. In other words, we 
would expect that, for any given risk group, 



where /chaining and /c ca iibration are the total numbers of claim records in 
the training and calibration data sets, respectively. We would 



'training training 



K training 



calibration (A" + /) calibration 



K calibration 
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likewise expect that the variance of the log severities of each 
risk group remains about the same when estimated on training data 
versus calibration data. These approximate equalities suggests 
that, if Equation (15) holds on the calibration data, then the 
following relationship should likewise hold on the training data: 



''calbration — 



1 



l 



(k + /)calibration ^calibration 



( -2 



^training I \ \ 

^calibration J (£ + /)training ^training 



r *2 



Assuming the above relationship does indeed hold, the appropriate 
credibility constraint for the training data would therefore be 



/ / ^calibration ^ I 1 

r calibration ^ ^training ~ J (k+fy 



1 



training ^training 



r ~2 

/ ,og -i 



The above constraint motivates the following equation: 



training ''calibration J ^training 



^calibration 



Number of Claim Records in the Calibration Set 
calibration J Number of Claim Records in the Training Set 



(16) 



Equation (16) provides a method for determining the value of r* 
to use on the training data during model construction given the 
desired value of r f to be achieved on the calibration data. 



The joint Pois son/log-normal model presented above is 
suitable for constructing new risk models from scratch 
independent of existing risk models or price structures. However, 
in practice, actuaries often face a somewhat different problem: 
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that of diagnosing an existing risk model or price structure to 
identify risk factors that are being overlooked and that would 
significantly affect the premiums of various policyholders if 
were they taken into account. For example, an insurer might have 
5 a book of business that is performing below expectations — the 

book might even be losing money — but no one knows why. One could 
approach this problem by constructing an entirely new risk model 
from scratch and developing a new price structure accordingly; 
however, that would still not identify the cause (s) of the 
10 existing problem. Moreover, tremendous amounts of time and effort 
are often required to obtain approvals from state insurance 
regulators before new price structures can be put into effect. In 

iQ the mean time, an insurer could be losing money, market share, or 

Hi both . 

; t 

15 PJ 

iji A desirable alternative is to identify the source of the 

T ™ problem and devise small changes to the existing price structure 
O to fix the problem. Small changes are more likely to receive 
nj quick approval from state regulators. In addition, they are more 
20 likely to have minimal impact on the majority of policyholders, 

ifi which is an important consideration from the point of view of 

customer satisfaction. Even in cases where a new price structure 
might in fact be warranted, it is still desirable to identify a 
quick fix in order to remain solvent while a new structure is 
25 being developed and approved. 

The joint Poisson/log-normal model in the form presented 
above is not sufficient to address this problem directly because 
it does not take existing prices into consideration. However, the 
30 model can be modified to produce a more sophisticated statistical 
model that takes an insurer's existing price structure into 
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account in order to model the profitability of policyholders. The 
numerical criteria derived from this more sophisticated 
statistical model would then cause the method for dividing larger 
risk groups into smaller risk groups to segment policyholders 
according to their levels of profitability. In the process, risk 
factors would be identified that distinguish the most profitable 
policyholders from the least profitable ones. Thus, the resulting 
particularized expression of the invention would explicitly 
search for and identify risk factors that are not already taken 
into account in an insurer's existing price structure. 



Insurers typically measure profitability in terms of loss 
ratio, which is the ratio of total claims paid over total 
premiums collected, or equivalently the ratio of estimated pure 
premium over the average premium charged per unit time: 

Loss Ratio = Incurred Claims 
Earned Premiums 

= Incurred Claims t Earned Exposure 
Earned Exposure " Earned Premiums 

Estimated Pure Premium 

~ Average Premium Charged per Unit Time * 

Loss ratio is related to gross profit margin by the following 
equation: 

r- T> r+Ayf Earned Premiums - Incurred Claims 
Gross Profit Margin = Earned Premiums 

= 1 - Loss Ratio . 



Although loss ratios can be reliably estimated only for 
entire risk groups or books of business, the relationship between 
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loss ratio and pure premium defined above does permit loss ratios 
of individual policyholder to be defined in the following manner: 

t -hL 

~ p i 

where L is the loss ratio of an individual policyholder, p is the 
premium charged to the policyholder per unit time, and X and pi 
are the frequency and mean severity parameters, respectively, 
that are ascribed to the policyholder by a suitable statistical 
model. Risk models are traditionally used to ascribe frequency 
and mean severity parameters to individual policyholders. To 
segment policyholders by loss ratio instead of by risk, one 
simply needs to change the nature of the statistical models that 
are used to ascribe frequencies and severities to individuals. To 
motivate the changes that are required, it is useful to examine 
the differences between risk modeling and loss ratio modeling in 
more detail . 

When developing new price structures property and casualty 
insurance, risk models are first constructed that divide 
policyholders into homogeneous risk groups according to their 
frequency and severity characteristics. Appropriate premiums are 
then determined for these risk groups in order to achieve desired 
loss-ratio targets. However, when diagnosing existing price 
structures, the premiums have already been decided and we are not 
interested in developing yet another risk model. Instead, we want 
to group policyholders into population segments according to 
their actual loss ratio performance. In other words, each segment 
might include policyholders from several known risk groups, but 
all policyholders in a segment should exhibit the same loss 
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ratio. The definitions of the segments would then define risk 
characteristics that are not reflected in the price structure. 

In order to segment policyholders according to their 
individual loss ratios, we develop mathematical models that allow 
different policyholders within a segment to have different 
frequency and mean severity parameters, and yet require all 
policyholders within a segment to have the same loss ratio. Such 
models can be constructed by treating the frequency and mean 
severity parameters of individual policyholders as functions of 
the premiums they are charged. Each segment would have its own 
frequency and severity functions. Within each segment, the 
frequency and severity functions would be constrained by the loss 
ratio of the segment so as to obey the following equation: 

ttp)-fi(p)=L-p , (17) 

where L is the loss ratio of the segment, and where k(p) and fi(p) 
are the functions that define the frequency and mean severity 
parameters, respectively, of the policyholders in the segment as 
a function of the premiums p that are charged to the 
policyholders per unit time. 

The importance of Equation (17) is that it enables 
appropriate statistical models to be readily constructed for the 
purpose of segmenting policyholders by loss ratio. In particular, 
the same frequency- severity models used to construct risk models 
can be adapted for use in loss-ratio modeling by simply replacing 
the frequency and mean severity parameters that appear in those 
models with frequency and mean severity functions that satisfy 
Equation (17) . For example, the frequency and mean severity 
parameters could be replaced with functions of the forms 
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= a-p 



(18) 



and 

fi(p)=P*p 1 ~ q , (19) 

respectively, where a, fi, and q are function parameters. These 
function parameters could then be estimated using maximum 
likelihood techniques in much the same manner as standard 
frequency and severity parameters. Once values for the function 
parameters have been determined, the loss ratio of a segment 
would be given by the product of the a and f$ parameters for that 
segment: Z = a-/?. 

Equations (18) and (19) constitute what is perhaps the 
simplest class of parametric functions that satisfy 
Equation (17) . They are used below as the basis for developing a 
loss-ratio model from the joint Poisson/log-normal model 
presented above. It is also possible to use more elaborate 
functions, but not without a significant increase in the 
complexity of the parameter estimation problem. 

For other forms of insurance involving risk parameters other 
than frequency and severity, the same general approach can be 
employed to convert statistical models of insurance risk for use 
in profitability modeling. All that is required is to replace the 
risk parameters in these models with parametric functions of 
premium charged that are analogous to Equations (18) and (19) . 
Estimation techniques similar to those presented below for the 
joint Poisson/log-normal model would then be used to estimate the 
parameters of these functions and to calculate the numerical 
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criteria (e.g., negative log-likelihood criteria) needed to 
identify splitting factors. 



When Equation (18) is introduced into the joint 
Poisson/log-normal model, the negative log-likelihood of the 
earned exposure /, of database record i as defined by Equation (9) 
becomes 



-logfefr)] = 



for non-claim records 



ha{pi) - log(a) - q\og(p t \ for claim records, 



(20) 



where 



Pi = Average Premium Charged per Unit Time for Record ; 

_ Earned Premium of Record /' 
~ Earned Exposure of Record / ' 

To calculate the negative log-likelihood of severity, 
Equation (19) must first be transformed into an equivalent 
formula for mean log severity before it can be substituted into 
Equation (10) . For log-normal distributions, 

//io g = logCfO-^aj^ . 
Thus, from Equation (19) , 

fi Xo% (p) = log^./? 1 q ^-jal g 

= log(/0 + O -?)log(p) -j< (2: 

= (i-9)iog(p) + [iag(»-i<]. 
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In general, the variance of the log severity oj og could also 
be treated as a function of the premium charged. However, the 
value of this variance was found to be fairly constant across 
risk groups in one examination we made of insurance data. 
Therefore, for the sake of simplicity, we will assume that the 
variance is independent of the premium charged, enabling of og to 
be treated as a straightforward parameter. 

Because, /? and a\ 0% are both parameters, Equation (21) can be 
reparameterized as follows in terms of a new parameter y 

t*\og(p) = (1 - q) logKp) + y , (22 ) 

where 

y= iogO?)-K ' < 23 > 

which implies that 

p = e g . (24) 

This reparameterization turns out to be more convenient from the 
point of view of obtaining maximum likelihood estimates of all 
parameters. Because Equations (23) and (24) define a one-to-one 
mapping between y and /? given a value for of og , there is a 
one-to-one mapping between the original set of parameters 
<a,#0f ogJ <7> and the new set of parameters (a, y, af og) q) . Consequently, 
estimating the new parameters using maximum-likelihood techniques 
and then calculating f$ using Equation (24) yields the same 
maximum-likelihood estimates of a, /?, a\ 0% and q that would be 
obtained using direct techniques. 
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m 




Given this reparameterization, the equation for the negative 
log-likelihood of the severity s, of the /"'th database record can 
be obtained by substituting Equation (22) into Equation (10) : 



-logfeO/)] = 



0 5 



log(V2^"^) + log(aiog) 

+ aog(5 f )-(l-^)log(p / )-y) 2 
2 < 



for non-claim records 



for settled claim records. 



(25) 



As before, the log(V27rs/) term is independent of the values of the 
model parameters. When summed over all database records, the 
value of this term is constant over all ways of segmenting 
policyholders into separate groups. The term can therefore be 
dropped for the purpose of optimizing the segmentation. After 
removing constant terms, the negative log-likelihood of severity 
for settled claim records therefore becomes 



log(criog) + 



(log^-d-^log^Q-y) 2 
2< 



In the case of open claim records, the value of the severity 
Si in the above formula is unknown. As before, an estimated 
negative log-likelihood of severity is therefore calculated. 
However, rather than using the expected value of the negative 
log-likelihood of severity for settled claims as the estimate for 
open claims, the average, value for the settled claim records is 
used instead: 



■ r x i£f =1 (log(^)-(l-^)log(p f )-y) 2 
log(aiog) + — 2 



log 



where the first k database records for the segment in the 
training data are assumed for convenience to be the settled claim 

Y0999-214 63 



records. This particular estimate has the desirable benefit that 
the maximum likelihood estimates of y and crf og (and, hence, ft) 
depend only on the settled claim records and are not affected by 
open claims, as demonstrated below in Equations (29) and (30) . 
Note that, because depends on y and of og as per Equation (24) , 
both y and of og must be estimated using maximum likelihood 
techniques in order to obtain a maximum likelihood estimate of /?. 
The usual unbiased estimator for variance is not appropriate in 
this instance. 



Combining the above formulas with Equation (20) yields the 
following equation for the negative log-likelihood of database 
record / with constant terms removed: 



tta(pt) , 

tidipt) - qlog(pi) + log|^— J 
i£* =1 Gog(^-(l-<7)log(p,)-y) 2 



+ 



2 < 



Ua{pi) q - qlog(pi) + log(-^p) 

(log(sQ-(l-<7)log(p,)-y) 2 
2< 



for non-claim records 



for open claim records 



for settled claim records. 



(26) 



It is important to keep in mind that, in the subformula of 
Equation (26) that pertains to open claim records, only settled 
claim records in the training data enter into the normalized sum 
that appears in the numerator of the forth term. As shown below, 
this numerator is in fact the maximum likelihood estimate of of og . 
When the above formula is applied to a separate validation data 
set after all parameters have been estimated on the training 
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data, the forth term therefore reduces to a constant value of 
1/2. 



Maximum likelihood parameter estimates are obtained by 
choosing parameter values that minimize the sum of Equation (26) 
over all database records in the training set. If the training 
data for a segment consists of N database records, the first k of 
which represent settled claims and the next / represent open 
claims, then the sum of Equation (26) is given by 



a[| ti(pd q ]- «(Z log(p/)] + (*+/)log( £ ff l ) 



(27) 



Z(log(5,)-(l-^)log^,)-y) 2 . 

/=1 



Equations for the maximum likelihood estimates of a, y, and 
of og are readily obtained by setting the corresponding partial 
derivatives of Equation (27) to zero. The partial derivative of £ 
with respect to a is given by 



K 

da 



k+l 
a 



When set to zero, the above equation yields the following maximum 
likelihood estimate of a: 



k + l 



N 

(Pi) 



(28) 



Similarly, the partial derivative of £ with respect to y 



K _ (k+rt -l 



Zlog(j,)-(l-?)logO?,)-y 



;=1 
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when set to zero yields 



y = TSiog(5,)- I V i Iiog(p,), 

K i=l K /=1 



(29) 



and the partial derivative of £ with respect to o\ 0% 



JL 

do\ { 



og 



= k±L(k±£\ 
^log \ k J 



r \ 
L 

^3 



I(log(^)-(l-^)log^)-7) 2 
/=i 



when set to zero yields 



2tog = ^ I(log(^;) - (1 - q) logipd - y) 7 



(30) 



When combined with Equation (29) , Equation (30) reduces to 



^log = (1 -?) 2 £log(ri - 2(1 -#)£log(*)log(p) + <7logW / 



(31) 



where 



^2 

a log(p) 



Tllog 2 ^) 

* /=1 



K ;=1 



(32) 



0"log(s)log(p) — 



1 

j£log(s,)log(p,) 

» ;=1 



if log(5/) 

K /•=! J 



T I log(p,) 

K ;=1 



(33) 



and 



iZlogV/) 



TllogC*) 

. K 7=1 



(34) 



Unfortunately, it is not possible to obtain a closed-form 
solution for the maximum likelihood estimate of q. Instead , 
Equation (27) must be minimized directly using numerical analysis 
techniques in order to estimate q. Equations (28) -(31) can be 
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used to eliminate all other parameters from Equation (27) . The 
maximum likelihood estimate of q is then determined by minimizing 
the resulting equation, which is given below 



e = 



(*+ /)iog[| /, (pd q ] - ?(z loeKp,) 

+ (^logld -qfdl^ - 2(1 -^ log( , 



~2 1 

a log(*)J 



+ 



3(* + /) 



- (A: + /)log(£ + /) 



To minimize the above equation with respect to q, it is 
useful to note that the fourth and fifth terms are constants and 
can be dropped from consideration. It is also useful to combine 
the first two terms and to normalize the equation by dividing 
through by (k + l). After performing these operations, the resulting 
equivalent equation to be minimized is 



£' = log 



N 
;=1 



Pi 
PJ 



+ y log[(l - q) 2 diogp) - 2(1 - ?)CTi og ( S )iog(p) + af og(i) ] , ( 35 ) 



where p is the geometric mean of the premiums charged per unit 
time for claim records only 



1 



k+l 



(M \ k+l -rbliogfc,) 
P = \Tlp i ] - w 



= e 



(36) 



The first term of Equation (35) corresponds to the frequency 
component of the negative log-likelihood of the training data, 
while the second term corresponds to the severity component. 
Minimizing their sum thus balances the degree of fit of Equation 
(18) to the observed occurrences of claim filings against the 
degree of fit of Equation (19) to the claim amounts. 
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The second term of Equation (35) is the logarithm of a 
quadratic formula that can be efficiently computed as q is varied 
once a?og(p)/ <7iog(5)iog(p) and <j\ 0 &s) have been calculated. The second term 
achieves its minimum value when 



.1 1 _ ^OgMtogfr) 

This value of q is a useful starting point when minimizing 
Equation (35). Note that if cx? O g(p) = 0, then ai og (j)iog&>) = 0 and the 
second term of Equation (35) becomes a constant (i.e., i log(af og(5 ))). 
Under these circumstances, the optimum value of q is dictated 
solely by the first term of Equation (35) . 



Unfortunately, the first term in Equation (35) cannot be 
efficiently computed as q is varied because the relevant data 
records would have to be rescanned for each new value of q that 
is considered, and because the number of such data records is 
typically very large. However, computationally-efficient 
approximations of the first term can be constructed and 
substituted into Equation (35) for the purpose of estimating q. 
Moreover, these approximations can be made as accurate as 
desired. 



The first term in Equation (35) turns out to be very well 
behaved. Its value is bounded above and below by linear functions 
of q. In particular, let j{q) be the value of the first term 



M = log 



I* 



pj 



(38) 



Y0999-214 



68 



and let p mm and p max be the minimum and maximum values , 
respectively, of p if l<i<N. Then for q>0, 



qlog 



Pmm 

. p J 



+ log 



2// 

Kj=l J 



< Aq) ^ ?iog 



ymax 

K p ) 



+ log 



8") • 



and for q< 0, 



?iog(M + iogf£ J <x^> < ^iogf^l + iogfi/,l 

\ P J ) \ P J V/=i -/ 



Note that, if p m m=Pmax / that is if all policyholders in a segment 
are charged the same premium per unit time, then J{q) becomes a 
constant 



M = log 



8") 



10 !J1 The choice of q is then arbitrary. In this case, q is a redundant 
parameter because all policyholders will be assigned the same 
% % frequency and mean severity by Equations (18) and (19) . 
!1J Segmenting on the basis of loss ratio becomes equivalent to 
^ segmenting on the basis of pure premium under these 

15 circumstances. 



In general, p mm will be strictly less than /? max . In this 
case, j{q) is not only bounded by linear function of q, it 
asymptotically converges to linear functions of q as the 
20 magnitude of q tends to infinity. For q>0, the asymptotic 

behavior of J{q) is revealed by rewriting Equation (38) in the 
form 
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A4) = ? lo g 



/'max 



+ log 



*^ />,=/> max / 



+ log 



1 + 



Pity max 



Pi-Pmax. 



(39) 



/'max 



= tflog -~- + log 
V P J 



i 

\Pi=Pvvex. J 



+ log(l + , S(9) ), 



where 



gil) = log 



I ti(pd 



1 \ 



Pi^pmax 



Z */ (Pmax) ^ 



x Pi~Pmdx 



(40) 



Note that is a monotonically decreasing function that tends 
to negative infinity in the limit as q tends to infinity. Thus, 
the linear and constant terms of Equation (39) dominate for </»(), 

For q<0, the asymptotic behavior is revealed by rewriting 
Equation (38) in the form 



M = ? log 



K P ) 



+ log 



S u 



\Pi=Pwin J 



+ log 



<l \ 



1 + 



Pi*Pnm 



7 

Pi~Ptton 



(41) 



= ?log 





r \ 








Pmin 


+ log 


I /, 


^ P J 








\Pt=Pvmi J 



where 
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<1 \ 



h(q) = log 



Pi*pTcan 



(42) 




\ Pi=Pmm 



J 



In this case, h(q) is a monotonically increasing function that 
tends to negative infinity as q tends to negative infinity* Thus, 
the linear and constant terms of Equation (41) dominate for ^«0. 

The above equations enable approximations of j{q) to be 
constructed in terms of approximations for g(q) and h(q). The 
accuracy of these approximations depend on what constraints can 
be assumed for the value of q. If policyholders are being charged 
premiums that accurately reflect their true levels of risk, then 
the value of q should be close to one because frequency is highly 
correlated with pure premium. However, the value of q becomes 
less predictable if significant risk factors exist that are not 
accounted for in pricing. For segments that are not being priced 
according to actual risk, one could easily imagine negative 
values of q (i.e., the higher the premium, the lower the 
frequency), and values of q that are greater than one (i.e., the 
lower the premium, the higher the severity) . 

Nevertheless, it is reasonable to assume that q will 
typically lie within some interval a<q<b, where a<0<b. 
Approximations for g(q) and h(q) can then be constructed that are 
highly accurate within this interval, but are less accurate when 
q falls outside the interval. The upper and lower bounds of the 
interval can be determined experimentally by performing data 
mining runs assuming default values for the bounds. If the value 
of q for any of the resulting segments is found to lie outside 
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the default interval, then the upper and lower bounds of the 
interval can be adjusted and data mining re-executed. This 
process can be repeated as many times as necessary to obtain 
appropriate bounds for q. Note, however, that the bounds should 
be made as tight as possible in order to maximize the accuracy of 
the approximating functions for g(q) and h(q). 

For values of q in the interval a<q<b, where a<0<b, 
accurate polynomial approximations of g(q) and h(q) can be 
constructed using Chebyshev interpolation. With this approach, 
the true value of g(q) would be calculated for + discrete 
values of q, labeled q^...,q m , where q { is given by 



Note that #o-0 and q m = b. Similarly, the true value of h(q) would 
be calculated for (w+1) discrete values of q, labeled ...,#0 / 
where q\ is given by 



Note that q- n ~CL and that Equations (43) and (44) agree on the 
value of qo (i.e., ^ 0 =0). The polynomial approximations of g(q) 
and h(q) for a<q<b are then given by 




(43) 





(44) 
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m 




(45) 



m 




\ J** J 



and 



C o 



i=-n n 



(46) 



The above equations have a number of important properties . 
The first is that the values of q t defined by Equations (43) 
and (44) correspond to the roots of Chebyshev polynomials of 
orders + and (w+1), respectively. For m 5 w<20, the 
approximation errors of Equations (45) and (46) are no worse than 
four times the approximation errors of optimal polynomials of 
degrees m and n (see, for example, G. Dahlquist, A. Bjorck, and 
N. Anderson, Numerical Methods. Englewood Cliffs, New Jersey: 
Prentice-Hall, 1974). For /w,«<100, the approximation errors are 
no worse than five times those of optimal polynomials. 
Equidistant values of q i9 on the other hand, could result in 
extremely large approximation errors, especially at the extreme 
ends of the interval a<q<b. Another property of the above 
equations is that Equations (45) and (46) are very stable from a 
numerical analysis standpoint even for extremely high-order 
polynomials. "Simplifying" these equations by expanding them into 
standard polynomial forms could potentially lead to numerically 
unstable calculations. Thus, the approximating functions defined 
by Equations (43) -(46) are very robust from both mathematical and 
computational standpoints . 
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For computational reasons, it is desirable to keep the 
values of m and n as small as possible. Appropriate values can 
be determined experimentally by observing the effect that 
different settings of m and n have on the segmentations that are 
produced. As m and n are increased, the approximating functions 
g(q) and h(q) will likewise increase in accuracy. However, a point 
will be reached beyond which further increases in accuracy will 
not affect the resulting segmentation. The corresponding values 
of m and n are therefore the most appropriate ones to use, since 
further increases would yield no additional benefit. 

Although accurate approximations of g(q) and h(q) are desired 
for a<q<b, it is still necessary to develop reasonable 
approximations for the case in which q falls outside this 
interval. Suitable approximations can be obtained by noting that 
Equations (40) and (42) are similar in form to Equation (38) ; 
hence, g(q) and h(q) are likewise asymptotically linear in q. Let 
Pmax2 be the second largest value of /?, , l<i<N. Then Equation 
(40) can be rewritten as 



r i o 



Pi~pTDSX.2 

I ti 

V Pi-PHBX J 



+ log 



1+- 



I tiip,) 

i 

Pi<Piaax2 

£ ti (Pmaxl) 



Pi=Pmix2 



The third term in the above equation tends to zero exponentially 
fast as q tends to infinity. The sums that appear in this term 
can be approximated with a simple exponential function to obtain 
the following rough approximation of g(q) for q>b 
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r i o 



P/=Pmax2 

I ft 



+ log 



\ Pl=Pmax J 



Pi<pmax2 



9(9 ~9m) 



Pi-ProBx2 



, (47) 



where q m =b as per Equation (43), and where 

I f,<p,)* M logfip,) 

J 

<P = ~ aZ ~ 10g(p m ax2) 



(48) 



The value of ^ in the above equation was selected so that not 
only are the values of g(q) and g(q) the same for q-q m , but their 
first derivatives are the same as well . 



To approximate h(q) for q<a, let p m m2 be the second smallest 
value of p if \<i<N. Then Equation (42) can be rewritten as 



Kq) = *l<*(^Sr) + log 



i 






/' 


Pi=Pvm\2 


+ log 


1 + 




S 'i(Pmin2)^ 


\ Pi=Pmn J 






Pi=Pnw2 ) 



The third term in the above equation also tends to zero 
exponentially fast as q tends to infinity. It can likewise be 
approximated using a simple exponential function to obtain the 
following approximation of h(q) for q<a 



Pi=Pmnl 
V. Pl=Paia J 



+ log 1 + 



Pi>Pran2 



Pi~Pvcml 



, (49) 



where q- n =a as per Equation (44), and where 
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Pi>PTWl2 



I tiipd 

i 

Pi>pTian2 



- log(p min2 ) . 



(50) 



When combined, Equations (39) -(50) yield the following 
approximation to Equation (35) : 



where 



t = f(q) + i log[(l -q) 2 °l&) - 2(1-^10^)10^) + 5^,)] , (51) 



f(q) = 



\ p J 



tflog + log 



q\og 



S ti 

i 

KPi=Pmsx J 



Pmm 


+ log 


r > 

Z // 


+ log 


1 + £ 


V P J 











for q > 0 



for#< 0. 



(52) 



In the above equation, g(q) is given by Equations (45) and (47) 
for 0<q<q m and q>q m/ respectively; is given by 

Equations (46) and (49) for q- n <q<0 and q<q- n , respectively. 



The optimum value of q is estimated by minimizing 
Equation (51) with respect to q. This minimization can be readily 
accomplished using standard function-minimization techniques 
(see, for example, W. H. Press, S. A. Teukolsky, 

W. T. Vetterling, and B. P. Flannery, Numerical Recipes In C, The 
Art of Scientific Computing,, Second Edition. New York: Cambridge 
University Press, 1992). Values for af og , y and /? can then be 
estimated using Equations (31) , (29) and (24) , respectively. The 
value of a can be estimated by expressing the denominator of 
Equation (28) in terms of f[q) as defined in Equation (38) , and 
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then using f(q) given above to estimate J(q). The resulting 
estimate of a is given by 

a = ^ k+1 ^ . (53) 

e 



The estimated loss ratio is then L-a-p. 

5 

To identify splitting factors, Equation (51) is summed over 
all segments to yield the overall score of the loss-ratio model 
on the training data. This score is then minimized in a stepwise 
fashion, where each step involves dividing a larger group of 
10 O policyholders into two or more smaller groups so as to reduce the 
y value of the overall score to the maximum extent possible. To 

ensure actuarial credibility, only those subdivisions that 
H= satisfy Equation (15) are considered during this top-down 
jj process . 

15 ;u 

J» Note that calculation of Equation (51) requires computing 

Jpj the sums defined in Equations (32) -(34) and (36), as well as 

"Hf computing several sums of the forms 

I // , I U (Pd ? , and I /, (pd q \ogfp,) 

20 for various subsets of data records <t> and various fixed values 
of q. These sums can be computed in a single pass over the data 
for each new splitting factor. Moreover, the sums obtained from 
disjoint subsets of data can be efficiently combined in the 
process of evaluating alternate splitting factors, in much the 

25 same way that similar sums are combined in standard tree-based 
modeling algorithms. 
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The validation-set technique presented earlier is also 
applied for the purpose of maximizing the predictive accuracy of 
the resulting loss-ratio model. However, instead of using 
Equation (51) to calculate the overall model score on the 
validation data, Equation (26) must be used instead. 
Equation (51) was derived for the purpose of estimating model 
parameters during training and it does not apply when evaluating 
loss-ratio models on separate data sets with the parameters held 
fixed. 

Equation (26) can be simplified for validation purposes by 
exploiting the fact that only settled claim records in the 
training data enter into the sum in the subformula of Equation 
(26) that pertains to open claim records. From Equation (30), 
that subformula simplifies to 



Dropping the constant term and substituting the resulting formula 
back into Equation (26) thus yields the following equation for 
the negative log-likelihood with constant terms removed for 
database records / in the validation data set: 




Uaipt) 



for non-claim records 




for open claim records 



6 = 



(54) 




Cog(*,)-(l-<7)log(p,)-y) 2 



for settled claim records. 
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If the validation data for a segment consists of N database 
records, the first k of which represent settled claims and the 
next / represent open claims, then the sum of Equation (54) over 
these records is given by 



The sum of Equation (55) over all segments is used as the overall 
score of the loss-ratio model on the validation data. The most 
predictive segmentation is determined by minimizing this overall 
score. 

Although the loss-ratio model presented above assumes a 
Pois son/log-normal model for the claims process, it should be 
pointed out that the same modeling methodology can be applied to 
develop loss-ratio models for other families of statistical 
models employed by actuaries, such as those described by Klugman 
et al. (see above) * 

In addition to using actuarial constraints in conjunction 
with insurance risk models, the present invention can be 
practiced in combination with statistical constraints developed 
for use with other kinds of statistical models, such as weighted 
least-squares models of the kind found in prior art regression 
tree methods such as CART (see L. Breiman et al. above) and 
SPSS f s implementation of CHAID (see, for example, 
http://www.SPSS.com). As previously discussed, weighted 
least-squares techniques can be used to develop models for 
predicting pure premium and loss ratio by making use of the fact 
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j it (pd lo «CP')] + <*+ /) 1o s(h?) 

S(iog(^)-(i-^)iog(p/)-7) 2 . 



(55) 



that the pure premium and loss ratio of a risk group can be 
expressed as weighted averages of the pure premiums and loss 
ratios, respectively, of the individual data records that belong 
to the risk group. Even though least-squares models are not 
well-suited to the statistical characteristics of insurance data, 
such models have the benefit of being extremely simple and, 
therefore, widely applicable. It is quite reasonable from a 
modeling standpoint to use such models for exploratory purposes 
before investing time and effort in developing more elaborate 
models that are tailored to the specific statistical 
characteristics of the data. However, prior art regression tree 
methods are still deficient in that they do not take actuarial 
credibility into account. The present invention, on the other 
hand, enables weighted least-squares models to be combined with 
actuarial credibility constraints, thereby yielding a more 
suitable modeling technique from an actuarial point of view. 

The individual pure premium of database record / is defined 

to be 



where s, is the claim amount associated with the /'th record and t t 
is the earned exposure of the record. Similarly, the individual 
loss ratio of database record z is defined to be 



for non-claim records 



Pure Premium, = < 



tr 



for settled claim records, 



0, 



for non-claim records 



Loss Ratio, = < 



Si 



Pi • // ' 



for settled claim records, 
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where s, and U are defined as above, and where /?, is the premium 
charged per unit time. The product is thus the earned 

premium of record i. Individual loss ratios are undefined for 
open claim records . 

If there were no open claim records, the estimated pure 
premium for a group of policyholders would be given by 



where N is the number of database records for the group, the 
first k of which are assumed for convenience to be claim records. 
Likewise, the estimated loss ratio for a group of policyholders 
would be given by 



Pure Premium = 




N ~ N 



Loss Ratio = 




N ~ N 




Note that the above equations have the general form 



X = 




(56) 



N ~ N 



;=1 (=1 



where 



0, 



for non-claim records 



(57) 



Wj ' 



for settled claim records, 
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and where Wf-tj in the case of pure premium, and Wj^pj-tj in the 
case of loss ratio. Thus, a single statistical model can be 
developed based on Equations (56) and (57) that can then be 
specialized for pure premium or loss ratio modeling by supplying 
appropriate values for the weights w t . 



Let us assume for the moment that Equation (56) is obtained 
by minimizing an optimization criterion Equation (56) would 
then be obtained by differentiating f with respect to the 
weighted average X and setting the result to zero. Rearranging 
Equation (56) leads to the following differential equation for 
the value of X that optimizes £ 



dx 



x 



K>=1 J 



N 



= 0 



where C\ can be any nonzero term that is not a function of the 
weighted average X. The optimization criterion £ can be recovered 
by integrating the above equation with respect to X: 



r - ( N ^ ~f N v 

= CyUx 2 2>, -X 2>,-*i 

L Vj=i J \j=i J\ 

= -y-[l>v/-(^-^) 2 J + C3 5 



+c 2 



where C2 and C3 are constants of integration, C3 = C% - "jS^i Wi -Xf 



From a maximum likelihood standpoint, the optimization 
criterion derived above can be viewed as the negative log 
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likelihood of the data values JG's under the assumption that the 
X's follow a weighted Gaussian distribution fiJCi) given by 



w,-(X,-X) 2 



The variance of each data value X t is thus 

Var[Xj] = §y , (59) 

and the negative log likelihood of X\ is 

lA 1 ■ .,,,f 27r<7 2 y Wj-jXi-X) 2 
-\ogMd = 2" 1 ° 8 1"w 7 TJ + 2^ • 

Summing the above equation and setting the result equal to £ 
reveals that 

N 



C 1= ± and Ca^glog^) , 



and that 



C = -hogAXd 



1=1 



2(T 2 



|>v,.(Z,-^) 2 + ^|log(^). 



(60) 



The above equation for the negative log likelihood criterion 
£ can be generalized to take open claim records into account by 
noticing that each data record effectively contributes an amount 
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# 



to the first term of Equation (60) . For nonclaim records, the 
above formula reduces to 



10 U 



In the case of settled claim records, the formula expands to 



For open claims, the value of s, is unknown. However, the above 
formula for settled claim records can be used for open claim 
records by approximating the second term of the formula by its 
average value with respect to the settled claim records : 



The three formulas above enable the negative log likelihood of 
record i to be defined as follows: 



6 



for non-claim records 



2a 2 



Wi • Jl 2 + £ - 2^) j + ~2 log(^ ^>^ ^ , for open claim records 



(61) 



[ w i * & + 5 '( w7 - 2X) ] + y log(%7^), for settled claim records. 



If the data for a risk group consists of N database records, the 
15 first k of which represent settled claims and the next / 

represent open claims, then the negative log likelihood of the 
data is given by the sum of Equation (61) for those data records: 
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|>, 4-{ I s(& -2X) + § s(& -2X)] +\ I log(^) 
> I - , ■ + ^ I - 2X) } + \ i 1 cg( . 



(62) 



Note that Equation (62) reduces to Equation (60) in the case 
where there are no open claims records (i.e., /=0) . 

The maximum likelihood estimate of the weighted average X 
for the group is determined by setting the corresponding 
derivative of the above equation to zero. The derivative of £ 
with respect to 1 is 



K 

dX 



1 



N i , i k 



;=1 



which , when set to zero yields the following maximum likelihood 
estimate of X: 



X = 



K /=! 

N 

/=1 



(63) 



Comparing Equation (63) to Equation (56) reveals that, when open 
claim records are present, the sum of the settled claim amounts 
must be scaled proportionately in order to estimate the sum of 
all claim amounts when calculating the weighted average X. 



Actuarial credibility can be enforced by placing a limit on 
the fractional standard error allowed in the above estimate of 
the weighted average X. The variance of X can be calculated by 
first noticing that Equation (63) , as obtained from 
Equation (62) , is actually a simplified version of the equation 
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f k 



x = 



N 



N 

Z^, 

J=l 



where X,=0 for i>k + l, and where the numerator omits Xi's for open 
claims records, whose indices lie in the range k+1 <i<k + I as per 
Equation (62) . The variance of the weighted average X is 
therefore given by 



Var[X\ = 



r \ 
k±l 
k 

N 

Z^/ 
Vi=l 



( k N \ 

Zw?-Far[JT,]+ Z w**Var[Xi] 

V=l i^fc+*fl 



which, from Equation (59) simplifies to 



k N 
7n2 ZW/+ Z Wi 



If-) 



The fractional standard error of the weighted average X is 
therefore given by 



iVar\X\ \ 



a 1 -fZw/+ Z w/| 

v=i /=^t+/+i ^ 



Z^/ 



The above equation is expressed in terms of the unweighted 
variance a 2 . The maximum likelihood estimate of this unweighted 
variance can be determined by setting the corresponding 
derivative of Equation (62) to zero. The derivative of £ with 
respect to a is 
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da a 3 



i=l K ;=1 1 



which, when set to zero yields the following maximum likelihood 
estimate of a 2 : 



n { k h», x h W: ) 



The fractional standard error of X therefore reduces to 



jVar[X] 
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Using the above equation, a limit r' can be placed on the 
maximum allowed fractional standard error of the weighted average 
X in order to constrain the splitting factors that are 
identified during model construction. 



JVar[X] 



k+l \r s i 



n ^ 
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< r 



(64) 



This limit would thus serve the same role as the actuarial 
credibility constraint for joint Poisson/log-normal models 
presented in Equation (15) . 



It is possible to use the minimum value of Equation (62) as 
the numerical criterion for selecting splitting factors. However, 
Equation (62) assumes that the Xj's follow the weighted Gaussian 
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distribution defined by Equation (58) and this assumption is 
certainly incorrect for the pure premiums and loss ratios of 
individual data records defined by Equation (57) . A more robust 
criterion would therefore be desirable. Equation (62) can be made 
more robust by assuming, for the purpose of selecting splitting 
factors, that the unweighted variance a 1 that appears in 
Equations (61) and (62) is constant across all segments that are 
constructed. Although this assumption is also likely to be 
invalid, the assumption has the desirable effect of causing 
segments to be merged based on similarities in their weighted 
averages X. Without the assumption, the preferred merging 
process may avoid merging segments based on differences in the 
unweighted variances a 2 of the segments even though the weighted 
averages X of the segments may be close in value or even 
identical . 

The above assumption allows Equations (61) and (62) to be 
rescaled by a constant factor of 2a 2 . In addition, the log(2na 2 /Wi) 
terms that appear in Equations (61) and (62) become constants 
when summed over all data records. These terms can therefore be 
dropped for optimization purposed. Equations (61) can therefore 
be simplified to yield the following rescaled the negative log 
likelihood of record / with constant terms removed: 



for non-claim records 




for open claim records 



(65) 




for settled claim records. 
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Summing Equation (65) over all data records for a risk group 
yields the following rescaled negative log likelihood with 
constant terms removed: 



;=1 K i-\ 

(66) 

7=1 /=1 

The minimum value of Equation (66) is the preferred numerical 
criterion for selecting splitting factors when weighted 
least-squared models are used for insurance risk modeling 
purposes . 

The preferred method steps of the overall invention are now 
disclosed. A preferred embodiment of the present invention 
includes features implemented as software tangibly embodied on a 
computer program product or program storage device for execution 
on a processor. For example, software implemented in a popular 
object-oriented computer executable code such as JAVA provides 
portability across different platforms. Those skilled in the art 
will appreciate that other procedure-oriented and object-oriented 
programming (OOP) environments, such as C++ and Smalltalk, can 
also be employed. Those skilled in the art will also appreciate 
that the methods of the present invention may be implemented as 
software for execution on a computer or other processor-based 
device. The software may be embodied on a magnetic, electrical, 
optical, or other persistent program and/or data storage device, 
including but not limited to: magnetic disks, DASD, bubble 
memory, tape, optical disks such as CD-ROM's and DVD's, and other 
persistent (also called nonvolatile) storage devices such as 
core, ROM, PROM, flash memory, or battery backed RAM. Those 
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skilled in the art will also appreciate that within the spirit 
and scope of the present invention, one or more of the components 
could be accessed and maintained directly via disk, a network, a 
server, or could be distributed across a plurality of servers. 

5 

Step 1 preferably comprises constructing an initial 
plurality of population segments and associated segment models. 
This initial plurality constitutes the initial value of what we 
will refer to as the ^current plurality." Unless prior knowledge 
10 about the application domain suggests otherwise, the initial 
plurality should preferably comprise a single segment model 
associated with the overall population of training data. 

y Step 2 preferably comprises selecting a population segment 

15 |51 and its associated segment model from the current plurality, 

*'Z excluding those segments and segment models that were selected in 

ill 

-c - 

,f previous applications of step 2 . 

: tj 

I? Step 3 preferably comprises replacing the segment and 

20 ]q segment model selected in step 2 with two or more smaller 

! {f segments and associated segment models preferably constructed 
from the selected segment according to the following method: 

a) For each explanatory data field (i.e., data fields whose 
25 values are allowed to be used to distinguish one 

population segment from another) , the selected segment 
should preferably be divided into at least two smaller, 
mutually exclusive segments based on the possible values 
of that explanatory data field. If subdivision is not 
30 possible, go to step 4. 

In the case of a categorical data field, each of the 
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smaller segments should preferably correspond to one of 
the category values admitted under the definition of the 
selected segment. If the data field is not mentioned in 
the definition of the selected segment, then a smaller 
segment should preferably be constructed for each 
possible category value for that data field. If the 
definition of the selected segment restricts the value of 
the data field to a subset of category values, then 
smaller segments should preferably be constructed only 
for category values in that subset. In both cases, it is 
possible that some category values may correspond to 
missing values for the data field. 

In the case of a numerical data field, the possible 
values of the data field should preferably be discretized 
into ordinal classes as described by Biggs et al. (see 
above) and segments should preferably be constructed for 
each of the resulting ordinal classes. Segments should 
also preferably be constructed for additional "floating" 
categories (see Kass above) that correspond to missing 
values for the data field. 

In all cases, segment models should preferably be 
constructed for the constructed segments . 

b) For each explanatory data field, those segments 

constructed for the explanatory data field in step 3a 
that admit missing values for the explanatory field 
should preferably be set aside and the following merge 
steps should preferably be performed on the remaining 
segments constructed for the explanatory field: 
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For nominal explanatory fields, all remaining 
segments that have at least one training record 
species count that lies below the threshold for 
that species should preferably be merged 
together and a segment model should preferably 
be constructed for the newly merged segment. 
For insurance risk or profitability modeling 
purposes, a threshold of six fully settled 
claims records should preferably be used. 

For ordinal explanatory fields, if all 
remaining segments have at least one training 
record species count that lies below the 
corresponding threshold referred to in 
step 3b (i), then all remaining segments should 
preferably be merged together and a segment 
model should preferably be constructed for the 
newly merged segment. Otherwise, pairs of 
remaining segments that satisfy the following 
conditions should preferably be repeatedly 
selected and merged, and segment models should 
preferably be constructed for the newly merged 
segments, until the conditions can no longer be 
satisfied or until a single segment is 
obtained: 

A) The values of the explanatory field that 
are admitted by the two segments to be 
merged should preferably be adjacent with 
respect to the ordering of the values for 
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that ordinal explanatory field. 

B) At least one training record species 
count for one of the segments to be 
5 merged should preferably lie below the 

corresponding threshold referred to in 
step 3b (i) , while all training record 
species counts for the other segment in 
the pair should preferably lie above the 
10 corresponding thresholds. 

c) For each explanatory data field, those segments 

constructed for the explanatory field in step 3a that 
hi admit missing values for the explanatory field should 

15 jly preferably be set aside. If two or more segments remain 

from among those constructed for the explanatory field, 
= p then pairs of these remaining segments should preferably 

]U be repeatedly selected and merged, and segment models 

;l= should preferably be constructed for the newly merged 

20 segments, so as to optimize the desired numerical 

criteria for selecting splitting factors subject to the 
following conditions: 

i) If at least one of the remaining segments does 
25 not satisfy the desired statistical constraints 

for segments, then at least one of the segments 
in the pair being merged should preferably not 
satisfy the statistical constraints either. 

30 For insurance risk or profitability modeling 

purposes using joint Poisson/log-normal models, 
the preferred statistical constraint for 
Y0999-214 93 



segments is given by Equation (15) . 

For insurance risk or profitability modeling 
purposes using weighted least-squares models, 
5 the preferred statistical constraint for 

segments is given by Equation (64) . 



ii) In the case of ordinal data fields, the values 
of the explanatory field that are admitted by 
10 the two segments being merged should preferably 

be adjacent with respect to the ordering of the 
values for that ordinal explanatory field. 

: ls ( For insurance risk modeling purposes using joint 

15 O Poisson/log-normal models, the preferred criterion for 

12 selecting splitting factors is to minimize the sum of 

«j Equation (12) for the resulting segments constructed for 

the explanatory field. 

20 Hi For policyholder profitability modeling purposes using 

;jj joint Poisson/log-normal models, the preferred criterion 

H " for selecting splitting factors is to minimize the sum of 

Equation (51) for the resulting segments constructed for 

the explanatory field. 

25 

For insurance risk or profitability modeling purposes 
using weighted least-squares models, the preferred 
criterion for selecting splitting factors is to minimize 
the sum of Equation (66) for the resulting segments 
30 constructed for the explanatory field. 
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The merging process described above should preferably be 
continued until only two segments remain (i.e., not 
including the segments that were preferably set aside 
that admit missing values for the field) . If one of these 
two remaining segments does not satisfy the desired 
statistical constraints for segments, then the two 
remaining segments should preferably be merged into a 
single segment. 

d) For each explanatory data field, those segments 
constructed for the explanatory field in step 3a that 
admit missing values for the explanatory field should 
preferably be set aside. If a single segment remains from 
among those constructed for the explanatory field, then 
the explanatory data field should preferably be 
eliminated from further consideration when at least one 
of the following conditions hold: 

i) the single remaining segment does not satisfy 
the desired statistical constraints for 
segments ; 

or 

ii) the single remaining segment does indeed 
satisfy the desired statistical constraints for 
segments, but no segments were set aside that 
admit missing values for the explanatory field. 

e) If all explanatory data fields were eliminated from 
further consideration in step 3d, then the segment 
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selected in step 2 cannot be divided into smaller 
segments that satisfy the desired statistical constraints 
for segments. Subdivision cannot be performed; therefore, 
go to step 4 . 

f) Otherwise, for each explanatory data field that was not 
eliminated from consideration in step 3d, and for each 
segment that was constructed for the explanatory field 
and that was set aside in step 3c because it admits 
missing values, if this missing-value segment does not 
satisfy the desired statistical constraints for segments, 
then the segment model of the segment selected in step 2 
should preferably be used as the segment model of the 
missing-value segment. 

g) For each explanatory data field that was not eliminated 
in step 3d, evaluate the segments and associated segment 
models that were constructed for that data field, 
preferably using the desired numerical criteria for 
selecting splitting factors, and select the explanatory 
data field that preferably optimizes these criteria. 

h) Remove the segment and its associated segment model that 
were selected in step 2 from the current plurality and 
replace them preferably with the segments and associated 
segments models that were constructed for the explanatory 
data field selected in step 3g. Place the segment and 
associated segment model that were removed in a buffer 
and establish linkages between the segment and associated 
segment model that were removed and their replacements. 
The segment and associated segment model that were 
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removed are said to be the "parents" of the replacements. 
Similarly, the replacements are said to be the "children" 
of their parents. 



Step 4 preferably comprises repeating steps 2 and 3 until 
step 2 can no long be applied. 

Step 5 preferably comprises moving the segments and 
associated segment models in the current plurality into the 
buffer referred to in step 3h. 

Step 6 preferably comprises evaluating the segments and 
associated segment models present in the buffer on a portion of 
the training data held aside for validation purposes, and 
assigning a score to each segment and associated segment model 
based on the evaluation, wherein lower scores indicate better 
models. The score should preferably correspond to the numerical 
criteria for selecting splitting factors. For insurance risk 
modeling purposes using joint Poisson/log-normal models, the 
score for each segment and associated segment model should 
preferably be given by Equation (12) . For policyholder 
profitability modeling purposes using joint Poisson/log-normal 
models, the score for each segment and associated segment model 
should preferably be given by Equation (55) . For insurance risk 
or profitability modeling purposes using weighted least-squares 
models, the score for each segment and associated segment model 
should preferably be given by Equation (66) . 

Step 7 preferably comprises applying Quinlan's reduced error 
pruning method (see J. R. Quinlan, 1987, above) to the tree of 
segments and associated segment models present in the buffer, 
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wherein the scores assigned in step 6 are used instead of the 
number of errors (i.e., misclassif ications) on the test set 
discussed by Quinlan. 

Step 8 preferably comprises moving the leaves of the pruned 
tree produced in step 7 from the buffer back into the current 
plurality. 

Step 9 preferably comprises outputting a specification of 
the plurality of segments and associated models, preferably to a 
storage device readable by a machine, thereby enabling the 
plurality to be readily applied to generate predictions. 
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